This work studies knowledge distillation (KD) and addresses its constraints
for recurrent neural network transducer (RNN-T) models. In hard distillation, a
teacher model transcribes large amounts of unlabelled speech to train a student
model. Soft distillation is another popular KD method that distills the output
logits of the teacher model. Due to the nature of RNN-T alignments, applying
soft distillation between RNN-T architectures having different posterior
distributions is challenging. In addition, bad teachers having high
word-error-rate (WER) reduce the efficacy of KD. We investigate how to
effectively distill knowledge from variable quality ASR teachers, which has not
been studied before to the best of our knowledge. We show that a sequence-level
KD, full-sum distillation, outperforms other distillation methods for RNN-T
models, especially for bad teachers. We also propose a variant of full-sum
distillation that distills the sequence discriminative knowledge of the teacher
leading to further improvement in WER. We conduct experiments on public
datasets namely SpeechStew and LibriSpeech, and on in-house production data.

研究使用知识蒸馏来训练循环神经网络转录器模型的限制，并探讨如何有效地从不同质量的 ASR 教师中蒸馏知识。我们发现，全加和蒸馏方法在 RNN-T 模型中表现最佳，特别是在针对质量差的教师时，另外我们还提出了一种变体的全加和蒸馏方法，提高了 WRE。