How to boost speech pre-training with textual data is an unsolved problem due
to the fact that speech and text are very different modalities with distinct
characteristics. In this paper, we propose a cross-modal
本文提出了一种新型的基于离散语音表示的非配对语音和文本联合预训练框架,即 Token2Vec,通过模态不可知的 Transformer 编码器和令牌级掩码语言建模(tMLM)进行预训练,在非 ASR 任务上也表现出很好的可转移性,相对于各种仅语音预训练的基线,Token2Vec 的性能显著提高,最高相对 WER 降低了 17.7%。