Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.

本文研究了将知识从一个训练规模较大的教师模型转移到较小的学生模型中的知识蒸馏技术，在LibriSpeech / LibriLight公共数据集（60k小时）和我们公司的内部数据（600k小时）上对大规模RNN-T模型的软目标和硬目标蒸馏进行了比较，发现当教师和学生具有不同的架构（如大教师和小流式学生）时，硬目标更有效。此外，软目标蒸馏在自训练场景（如迭代大型教师训练）中效果更好。通过使用软目标蒸馏进行Noisy Student训练，成功在LibriSpeech上实现了新的SoTA词误率（dev-other上8％的相对改进），并允许我们的生产教师不断适应新的数据域。

软硬目标RNN-T蒸馏在大规模ASR中的比较