Visual speech recognition models extract visual features in a hierarchical
manner. At the lower level, there is a visual front-end with a limited temporal
receptive field that processes the raw pixels depicting the lips or faces. At
the higher level, there is an encoder that attends to the embeddings produced
by the front-end over a large temporal receptive field. Previous work has
focused on improving the visual front-end of the model to extract more useful
features for speech recognition. Surprisingly, our work shows that complex
visual front-ends are not necessary. Instead of allocating resources to a
sophisticated visual front-end, we find that a linear visual front-end paired
with a larger Conformer encoder results in lower latency, more efficient memory
usage, and improved WER performance. We achieve a new state-of-the-art of
$12.8\%$ WER for visual speech recognition on the TED LRS3 dataset, which
rivals the performance of audio-only models from just four years ago.

提出采用线性视觉前端结合更大 Conformer 编码器来实现更低的延迟，更高的内存效率和更好的 WER 性能，从而达到新的 TED LRS3 数据集上的视觉语音识别的最佳性能。