Large language models have proven themselves highly flexible, able to solve a
wide range of generative tasks, such as abstractive summarization and
open-ended question answering. In this paper we extend the capabilities of LLMs
by directly attaching a small audio encoder allowing it to perform speech
recognition. By directly prepending a sequence of audial embeddings to the text
token embeddings, the LLM can be converted to an automatic speech recognition
(ASR) system, and be used in the exact same manner as its textual counterpart.
Experiments on Multilingual LibriSpeech (MLS) show that incorporating a
conformer encoder into the open sourced LLaMA-7B allows it to outperform
monolingual baselines by 18% and perform multilingual speech recognition
despite LLaMA being trained overwhelmingly on English text. Furthermore, we
perform ablation studies to investigate whether the LLM can be completely
frozen during training to maintain its original capabilities, scaling up the
audio encoder, and increasing the audio encoder striding to generate fewer
embeddings. The results from these studies show that multilingual ASR is
possible even when the LLM is frozen or when strides of almost 1 second are
used in the audio encoder opening up the possibility for LLMs to operate on
long-form audio.

通过直接添加小型音频编码器，扩展大型语言模型的能力，实现与其文本版本相同的自动语音识别系统，并在 Multilingual LibriSpeech 上的实验证明，即使在 LLM 被冻结或者音频编码器使用几乎 1 秒的步幅生成更少嵌入时，多语种 ASR 仍然可行，从而为 LLMs 在长篇音频中进行操作开辟了可能性。

利用语音识别能力激发大型语言模型

Prompting Large Language Models with Speech Recognition Abilities

This report presents the technical details of our submission on the EGO4D
Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the
OxfordVGG team. We present WhisperX, a system for efficient speech
transcription of long-form audio with word-level time alignment, along with two
text normalisers which are publicly available. Our final submission obtained
56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the
leaderboard. All baseline codes and models are available on
this https URL

这篇报告介绍了我们（OxfordVGG 团队）参与 EGO4D 音频 - 视觉自动语音识别挑战 2023 的技术细节。我们提出了 WhisperX 系统，用于高效转录长篇音频，并具有单词级别的时间对齐，同时还提供了两个公开可用的文本规范化器。我们的最终提交在挑战测试集中取得了 56.0% 的词错误率（WER），在排行榜上位居第一。该报告还提供了所有基准代码和模型的链接。

OxfordVGG 参加 EGO4D AV 转录挑战

OxfordVGG Submission to the EGO4D AV Transcription Challenge

Improving the performance of end-to-end ASR models on long utterances ranging
from minutes to hours in length is an ongoing challenge in speech recognition.
A common solution is to segment the audio in advance using a separate voice
activity detector (VAD) that decides segment boundary locations based purely on
acoustic speech/non-speech information. VAD segmenters, however, may be
sub-optimal for real-world speech where, e.g., a complete sentence that should
be taken as a whole may contain hesitations in the middle ("set an alarm for...
5 o'clock").
We propose to replace the VAD with an end-to-end ASR model capable of
predicting segment boundaries in a streaming fashion, allowing the segmentation
decision to be conditioned not only on better acoustic features but also on
semantic features from the decoded text with negligible extra computation. In
experiments on real world long-form audio (YouTube) with lengths of up to 30
minutes, we demonstrate 8.5% relative WER improvement and 250 ms reduction in
median end-of-segment latency compared to the VAD segmenter baseline on a
state-of-the-art Conformer RNN-T model.

使用端到端的自动语音识别模型代替传统的语音活动检测器 (VAD)，在处理长的音频片段时，不仅能够使用更好的声学特征进行分割决策，还可以使用文本解码得到的语义特征，从而有更好的性能表现。在 30 分钟内的真实世界音频实验中，相比于使用 VAD，我们展示了在最先进的 Conformer RNN-T 模型上 8.5% 的相对 WER 改进和 250 ms 的额外分割延迟减少。