This paper presents a new approach for end-to-end audio-visual multi-talker
speech recognition. The approach, referred to here as the visual context
attention model (VCAM), is important because it uses the available video
information to assign decoded text to one of multiple visible faces. This
essentially resolves the label ambiguity issue associated with most
multi-talker modeling approaches which can decode multiple label strings but
cannot assign the label strings to the correct speakers. This is implemented as
a transformer-transducer based end-to-end model and evaluated using a two
speaker audio-visual overlapping speech dataset created from YouTube videos. It
is shown in the paper that the VCAM model improves performance with respect to
previously reported audio-only and audio-visual multi-talker ASR systems.

本文提出了一种新的终端对终端的音视频多人说话识别方法 - 视觉上下文注意力模型 (VCAM)，使用可用的视频信息将解码的文本分配给多个可见面孔中的一个，具有解决多人说话建模方法中的标签歧义问题，该方法实现为基于 Transformer-Transducer 的终端到终端模型，并使用来自 YouTube 视频的两个说话者音频 - 视觉重叠话语数据集进行评估，表明 VCAM 模型相对于之前报告的仅音频和音视频多人说话识别系统提高了性能。