Multilingual self-supervised speech representation models have greatly enhanced the speech recognition performance for low-resource languages, and the compression of these huge models has also become a crucial prerequisite for their industrial application. In this paper, we propose DistilXLSR, a distilled cross-lingual speech representation model. By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data. We also design a layer-jumping initialization method to fully leverage the teacher's pre-trained weights. Experiments on 2 kinds of teacher models and 15 low-resource languages show that our method can reduce the parameters by 50% while maintaining cross-lingual representation ability. Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models.

本文介绍了一种基于DistilXLSR的语音表征模型，通过随机打乱现有语音的音素，降低语言信息，在只使用英语数据的情况下，压缩跨语言模型并设计一种层级初始化方法，成功减少50%参数并在15种低资源语言和2种教师模型的实验中保持了跨语言表征能力，证明了其在各种语言/教师模型中具有普适性，有潜力提高英语预训练模型的跨语言性能。

DistilXLSR: 轻量级跨语言语音表示模型