Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder layers, which hinders its deployment on edge devices. A small Transformer model with fewer encoder layers is preferred for computational efficiency, but it is prone to performance degradation. In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning). We introduce layer-wise T-S learning and objective shifting mechanisms to guide the small student model to learn intermediate representations from the large teacher model. Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation on LibriCSS dataset. Utilizing more unlabeled speech data, our ultra fast speech separation models achieve more than 10% relative WER reduction.

该论文提出了一种使用教师-学生学习方法，采用逐层教学和目标偏移机制的超快速语音分离Transformer模型，相较于从头开始训练的小型Transformer模型，在LibriCSS数据集上，能够将语音分离的单词错误率（WER）降低5%以上，并利用更多的未标记语音数据实现超过10%的相对WER降低。

带有师生学习的极速语音分离模型