Using Self-Supervised Learning (SSL) as model initialization is now common to obtain strong results in Speech Translation (ST). However, they also impose a large memory footprint, hindering on-device deployment. In this paper, we leverage the SSL models by pretraining smaller models on their Discrete Speech Units (DSU). We pretrain encoder-decoder models on 1) Filterbank-to-DSU and 2) DSU-to-Translation data, and take the encoder from 1) and the decoder from 2) to initialise a new model, finetuning this on limited speech-translation data. The final model becomes compact by using the DSU pretraining to distil the knowledge of the SSL model. Our method has several benefits over using DSU as model inputs, such as shorter inference pipeline and robustness over (DSU) tokenization. In contrast to ASR pretraining, it does not require transcripts, making it applicable to low-resource settings. Evaluation on CoVoST-2 X-En shows that our method is >$0.5$ BLEU better than a ST model that directly finetune the SSL model, given only half the model size, and on a par with ASR pretraining.

使用自监督学习作为模型初始化在语音翻译中取得较好结果已经很常见，但也对设备上的部署造成了大量的内存开销。本文通过在离散语音单元上对自监督学习模型进行预训练，从而在有限的语音翻译数据上微调初始化的新模型，并利用离散语音单元预训练来凝结自监督学习模型的知识，从而使得最终模型更加紧凑。我们的方法相比于将离散语音单元用作模型输入，具有短推理流程和对（离散语音单元）分词具有鲁棒性等多个优点。与自动语音识别的预训练相比，它不需要转录，因此适用于资源有限的环境。在CoVoST-2 X-En数据集上的评估结果显示，我们的方法比直接微调自监督学习模型的语音翻译模型具有更高的BLEU得分（提升0.5），且模型大小仅为其一半，并且与自动语音识别的预训练方法相媲美。

通过离散语音单元预训练的紧凑语音翻译模型