The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.

本文研究了使用公共英语ASR语料库训练仅解码器模型（DOTA）相比于基于编码器-解码器的开源复制模型（OWSM）和Whisper的大型语言模型（Whisper large-v3），在几乎所有英语ASR基准测试集上取得更好的性能，并在15个测试集中的7个上超过了Whisper。我们在宽松许可下发布了我们的代码库和模型检查点。

基于公共语音识别语料训练的仅解码器模型的极限探索