The growing need for instant spoken language transcription and translation is
driven by increased global communication and cross-lingual interactions. This
has made offering translations in multiple languages essential for user
applications. Traditional approaches to automatic speech recognition (ASR) and
speech translation (ST) have often relied on separate systems, leading to
inefficiencies in computational resources, and increased synchronization
complexity in real time. In this paper, we propose a streaming
Transformer-Transducer (T-T) model able to jointly produce many-to-one and
one-to-many transcription and translation using a single decoder. We introduce
a novel method for joint token-level serialized output training based on
timestamp information to effectively produce ASR and ST outputs in the
streaming setting. Experiments on {it,es,de}->en prove the effectiveness of our
approach, enabling the generation of one-to-many joint outputs with a single
decoder for the first time.

提出了一种流式 Transformer-Transducer (T-T) 模型，能够使用单个解码器联合生成多对一和一对多的转录和翻译，并引入了一种基于时间戳信息的新颖方法来有效地在流式环境中生成语音识别和语音翻译的输出。通过在 {it, es, de}->en 上进行的实验证明了我们方法的有效性，首次实现了使用单个解码器生成一对多联合输出。

利用时间戳信息进行序列化联合流式识别和翻译

Leveraging Timestamp Information for Serialized Joint Streaming  Recognition and Translation

This paper presents a new approach for end-to-end audio-visual multi-talker
speech recognition. The approach, referred to here as the visual context
attention model (VCAM), is important because it uses the available video
information to assign decoded text to one of multiple visible faces. This
essentially resolves the label ambiguity issue associated with most
multi-talker modeling approaches which can decode multiple label strings but
cannot assign the label strings to the correct speakers. This is implemented as
a transformer-transducer based end-to-end model and evaluated using a two
speaker audio-visual overlapping speech dataset created from YouTube videos. It
is shown in the paper that the VCAM model improves performance with respect to
previously reported audio-only and audio-visual multi-talker ASR systems.

本文提出了一种新的终端对终端的音视频多人说话识别方法 - 视觉上下文注意力模型 (VCAM)，使用可用的视频信息将解码的文本分配给多个可见面孔中的一个，具有解决多人说话建模方法中的标签歧义问题，该方法实现为基于 Transformer-Transducer 的终端到终端模型，并使用来自 YouTube 视频的两个说话者音频 - 视觉重叠话语数据集进行评估，表明 VCAM 模型相对于之前报告的仅音频和音视频多人说话识别系统提高了性能。

使用主动说话者注意力模块的端到端多讲话人音频 - 视觉自动语音识别

End-to-end multi-talker audio-visual ASR using an active speaker attention module

Recently, several types of end-to-end speech recognition methods named
transformer-transducer were introduced. According to those kinds of methods,
transcription networks are generally modeled by transformer-based neural
networks, while prediction networks could be modeled by either transformers or
recurrent neural networks (RNN). This paper explores multitask learning, joint
optimization, and joint decoding methods for transformer-RNN-transducer
systems. Our proposed methods have the main advantage in that the model can
maintain information on the large text corpus. We prove their effectiveness by
performing experiments utilizing the well-known ESPNET toolkit for the widely
used Librispeech datasets. We also show that the proposed methods can reduce
word error rate (WER) by 16.6 % and 13.3 % for test-clean and test-other
datasets, respectively, without changing the overall model structure nor
exploiting an external LM.

本论文探讨了 transformer-RNN-transducer 系统的多任务学习、联合优化和联合解码方法，证明了这些方法能够有效地降低字词错误率，从而保持大型文本语料库的信息。