In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM can more effectively recognize and translate lip movements with just 15 hours of labeled data, compared to the recent translation model trained with 433 hours of labeld data.

该论文提出了一种新的框架——Visual Speech Processing incorporated with LLMs (VSP-LLM)，通过引入LLMs的强大能力，最大化了上下文建模能力。在MuAViC基准测试数据集中，经验证明，相比于使用433小时标记数据训练的最近的翻译模型，VSP-LLM可以更有效地识别和翻译唇部运动，仅需15小时标记数据。

视觉语音与语言的交汇点：高效和上下文感知的视觉语音处理框架(VSP-LLM)