In this work, we introduce a simple yet efficient post-processing model for automatic speech recognition (ASR). Our model has Transformer-based encoder-decoder architecture which "translates" ASR model output into grammatically and semantically correct text. We investigate different strategies for regularizing and optimizing the model and show that extensive data augmentation and the initialization with pre-trained weights are required to achieve good performance. On the LibriSpeech benchmark, our method demonstrates significant improvement in word error rate over the baseline acoustic model with greedy decoding, especially on much noisier dev-other and test-other portions of the evaluation dataset. Our model also outperforms baseline with 6-gram language model re-scoring and approaches the performance of re-scoring with Transformer-XL neural language model.

本文介绍了一种用于自动语音识别（ASR）的简单而有效的后处理模型。我们的模型使用基于Transformer的编码器-解码器架构，将ASR模型输出“翻译”成语法和语义正确的文本。作者探讨了不同的规范化和优化策略，并表明需要广泛的数据增强和预训练权重的初始化才能实现良好的性能。在LibriSpeech基准测试中，我们的方法在词错误率上表现优异，尤其是在更嘈杂的dev-other和test-other部分的评估数据集上。我们的模型还通过6-gram语言模型重新评分超过了基础模型，并接近于使用Transformer-XL神经语言模型重新评分的性能。

基于Transformer序列到序列模型的自动语音识别纠错