Audio-visual automatic speech recognition (AV-ASR) models are very effective
at reducing word error rates on noisy speech, but require large amounts of
transcribed AV training data. Recently, audio-visual self-supervised learning
(SSL) approaches have been developed to reduce this dependence on transcribed
AV data, but these methods are quite complex and computationally expensive. In
this work, we propose replacing these expensive AV-SSL methods with a simple
and fast \textit{audio-only} SSL method, and then performing AV supervised
fine-tuning. We show that this approach is competitive with state-of-the-art
(SOTA) AV-SSL methods on the LRS3-TED benchmark task (within 0.5% absolute
WER), while being dramatically simpler and more efficient (12-30x faster to
pre-train). Furthermore, we show we can extend this approach to convert a SOTA
audio-only ASR model into an AV model. By doing so, we match SOTA AV-SSL
results, even though no AV data was used during pre-training.

使用简单且快速的音频自我监督学习方法，并进行音视频模型的有指导微调，可在减少大量文本数据依赖的同时与最先进的音视频自我监督学习方法竞争，并且更为高效和快速。

音频 - 视觉微调的音频识别模型

Audio-visual fine-tuning of audio-only ASR models

This report presents the technical details of our submission on the EGO4D
Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the
OxfordVGG team. We present WhisperX, a system for efficient speech
transcription of long-form audio with word-level time alignment, along with two
text normalisers which are publicly available. Our final submission obtained
56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the
leaderboard. All baseline codes and models are available on
this https URL

这篇报告介绍了我们（OxfordVGG 团队）参与 EGO4D 音频 - 视觉自动语音识别挑战 2023 的技术细节。我们提出了 WhisperX 系统，用于高效转录长篇音频，并具有单词级别的时间对齐，同时还提供了两个公开可用的文本规范化器。我们的最终提交在挑战测试集中取得了 56.0% 的词错误率（WER），在排行榜上位居第一。该报告还提供了所有基准代码和模型的链接。

OxfordVGG 参加 EGO4D AV 转录挑战

OxfordVGG Submission to the EGO4D AV Transcription Challenge

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition
by introducing the video modality as an additional source of information. In
this work, the information contained in the motion of the speaker's mouth is
used to augment the audio features. The video modality is traditionally
processed with a 3D convolutional neural network (e.g. 3D version of VGG).
Recently, image transformer networks arXiv:2010.11929 demonstrated the ability
to extract rich visual features for image classification tasks. Here, we
propose to replace the 3D convolution with a video transformer to extract
visual features. We train our baselines and the proposed model on a large scale
corpus of YouTube videos. The performance of our approach is evaluated on a
labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our
best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10%
and 15% relative improvements over our convolutional baseline. We achieve the
state of the art performance of the audio-visual recognition on the LRS3-TED
after fine-tuning our model (1.6% WER). In addition, in a series of experiments
on multi-person AV-ASR, we obtained an average relative reduction of 2% over
our convolutional video frontend.

本文提出使用视频变压器替换三维卷积进行视觉特征提取，从而提高音频 - 视觉自动语音识别的性能，并在大规模的 YouTube 视频语料库以及 LRS3-TED 公共语料库上进行了评估。实验结果表明，该方法在 LRS3-TED 上取得了国际领先的性能表现。另外，在多人音频 - 视觉自动语音识别方面，该方法相对于三维卷积实现了平均降低 2% 的性能损失。

基于 Transformer 的音视频前端技术为单人和多人视频实现语音识别

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition  for Single and Multi-Person Video

In this paper, we present methods in deep multimodal learning for fusing
speech and visual modalities for Audio-Visual Automatic Speech Recognition
(AV-ASR). First, we study an approach where uni-modal deep networks are trained
separately and their final hidden layers fused to obtain a joint feature space
in which another deep network is built. While the audio network alone achieves
a phone error rate (PER) of $41\%$ under clean condition on the IBM large
vocabulary audio-visual studio dataset, this fusion model achieves a PER of
$35.83\%$ demonstrating the tremendous value of the visual channel in phone
classification even in audio with high signal to noise ratio. Second, we
present a new deep network architecture that uses a bilinear softmax layer to
account for class specific correlations between modalities. We show that
combining the posteriors from the bilinear networks with those from the fused
model mentioned above results in a further significant phone error rate
reduction, yielding a final PER of $34.03\%$.

本文介绍深度多模态学习的方法，用于合并语音和视觉特征进行音视频自动语音识别。实验结果表明，使用深度网络的融合模型和双线性 softmax 层能够进一步降低电话错误率。