The advancements of technology have led to the use of multimodal systems in
various real-world applications. Among them, the audio-visual systems are one
of the widely used multimodal systems. In the recent years, associating face
and voice of a person has gained attention due to presence of unique
correlation between them. The Face-voice Association in Multilingual
Environments (FAME) Challenge 2024 focuses on exploring face-voice association
under a unique condition of multilingual scenario. This condition is inspired
from the fact that half of the world's population is bilingual and most often
people communicate under multilingual scenario. The challenge uses a dataset
namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice
association in multilingual environments. This report provides the details of
the challenge, dataset, baselines and task details for the FAME Challenge.

面部 - 语音相关的多语言环境问题是 FAME Challenge 2024 探索的主题，利用 Multilingual Audio-Visual (MAV-Celeb) 数据集来进行研究和评估。

2024 年多语言环境下的面声关联（FAME）挑战评估计划

Face-voice Association in Multilingual Environments (FAME) Challenge  2024 Evaluation Plan

In this work, we propose a technique to transfer speech recognition
capabilities from audio speech recognition systems to visual speech
recognizers, where our goal is to utilize audio data during lipreading model
training. Impressive progress in the domain of speech recognition has been
exhibited by audio and audio-visual systems. Nevertheless, there is still much
to be explored with regards to visual speech recognition systems due to the
visual ambiguity of some phonemes. To this end, the development of visual
speech recognition models is crucial given the instability of audio models. The
main contributions of this work are i) building on recent state-of-the-art
word-based lipreading models by integrating sequence-level and frame-level
Knowledge Distillation (KD) to their systems; ii) leveraging audio data during
training visual models, a feat which has not been utilized in prior word-based
work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an
efficient technique that aids the model in distilling knowledge at the sequence
model encoder. This work proposes a novel and competitive architecture for
lip-reading, as we demonstrate a noticeable improvement in performance, setting
a new benchmark equals to 88.64% on the LRW dataset.

本文提出了一种从音频语音识别系统向视觉语音识别器转移技术的方法，其目标是在读唇模型训练过程中利用音频数据。