Creating and sharing knowledge for telecommunications

Project: Unified Transcription and Translation for Extended Reality

Acronym: UTTER
Main Objective:
The research in UTTER is motivated by the following two use-cases:
- Virtual Assistant for Online Multilingual Meetings The assistant should be able to translate from the speaker
into the language of the listener, producing a summary of the meeting, and noting action points (minuting).
- Multilingual Customer Service Dialogue Tool The tool should enable a customer service agent to provide support to global users (by text or voice) in cases where the customer and agent speak di erent languages, using the context of the conversation to assist the agent, with guidance to provide helpful, personalized answers that take into account the formality level and the cultural context of the customer and the brand, as well as monitoring the satisfaction
and engagement of the customer.

We tackle these use-cases by extending the state-of-the-art in translation and summarisation in the following ways:
- Translation should be multimodal, i.e. equally strong for speech input as it is for text input.
- All language technologies should be multilingual – in this project we will cover 6 languages: English, French,
German, Portuguese, Dutch, and Korean.
- Dialogue generation and translation should take into account the context of the conversation and its history, as well as other forms of context, such as the meeting notes, desired politeness level, and the speakers’ emotional status
(e.g. the sentiment of the customer).
- Translation of speech should be able to take into account paralinguistic aspects such as intonation, as this can often
change the meaning of the utterance, and should track the identity of the speaker.
- A summary of a meeting should include the action points generated by the meeting (i.e. the minutes).
- Summarisation and minuting should be explainable, in other words everything included in the meeting summary
should be relatable to the content.
- Translation and summarisation should be e cient; in particular, speech translation should be real-time.
- Systems should be robust and confidence-aware: they should be resilient to typos, acoustic noise, and recognition
errors, and they should be able to report their uncertainty.

We will achieve these through the use of pre-trained XR models, but in an fully open pipeline, where questions on
bias, fairness, risk etc. can be examined.
Reference: Horizon Europe, contract 101070631
Funding: EU
Start Date: 01-10-2022
End Date: 30-09-2025
Team: André Filipe Torres Martins, Gabriel Falcao Paiva Fernandes, Chrysoula Zerva
Groups: Pattern and Image Analysis – Lx, Multimedia Signal Processing – Co
Partners: University of Amsterdam, University of Edinburgh, Naver Labs, Unbabel, Instituto de Telecomunicacoes
Local Coordinator: André Filipe Torres Martins

Associated Publications