Pre-trained Transformer-based speech models have shown striking performance when fine-tuned on various downstream tasks such as automatic speech recognition and spoken language identification (SLID). However, the problem of domain mismatch remains a challenge in this area, where the domain of the pre-training data might differ from that of the downstream labeled data used for fine-tuning. In multilingual tasks such as SLID, the pre-trained speech model may not support all the languages in the downstream task. To address this challenge, we propose self-supervised adaptive pre-training (SAPT) to adapt the pre-trained model to the target domain and languages of the downstream task. We apply SAPT to the XLSR-128 model and investigate the effectiveness of this approach for the SLID task. First, we demonstrate that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages. Second, we apply SAPT on four different datasets in a few-shot learning setting, showing that our approach improves the sample efficiency of XLSR during fine-tuning. Our experiments provide strong empirical evidence that continual adaptation via self-supervision improves downstream performance for multilingual speech models.

预训练的基于Transformer的语音模型在下游任务（如自动语音识别和口语语言识别）上表现出了令人瞩目的性能，但领域不匹配的问题仍然是一个挑战。为了解决这个问题，我们提出了自监督自适应预训练（SAPT）来适应下游任务的目标领域和语言。我们将SAPT应用于XLSR-128模型，并研究了该方法在SLID任务中的有效性。实验证明，SAPT在FLEURS基准测试中提高了XLSR的性能，尤其是对于少数语言，增益高达40.1%。我们还在少样本学习设置中对四个不同数据集应用了SAPT，结果显示我们的方法提高了XLSR的样本效率。我们的实验证据强有力地证明，通过自监督实现持续自适应可以提升多语言语音模型的下游性能。

自监督自适应多语言语音模型的预训练用于语言和方言识别