The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos. Our key contributions are: (1) we compare two models
for lip reading, one using a CTC loss, and the other using a
sequence-to-sequence loss. Both models are built on top of the transformer
self-attention architecture; (2) we investigate to what extent lip reading is
complementary to audio speech recognition, especially when the audio signal is
noisy; (3) we introduce and publicly release a new dataset for audio-visual
speech recognition, LRS2-BBC, consisting of thousands of natural sentences from
British television. The models that we train surpass the performance of all
previous work on a lip reading benchmark dataset by a significant margin.

本论文旨在识别带有或不带有音频的说话者嘴唇所述的短语和句子，我们提出了使用自注意力机制的 CTC 和序列到序列两种模型进行唇语识别，并研究唇语识别在有噪音的情况下与音频识别的互补性，同时我们介绍并公开发布了英国电视上成千上万自然语言的新数据集 LRS2-BBC，我们建立的模型在实验中的表现均超过了以前的相关工作。