Building a good speech recognition system usually requires large amounts of transcribed data, which is expensive to collect. To tackle this problem, many unsupervised pre-training methods have been proposed. Among these methods, Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and Transformer backbone. However, many aspects of MPC have not been fully investigated. In this paper, we conduct a further study on MPC and focus on three important aspects: the effect of pre-training data speaking style, its extension on streaming model, and how to better transfer learned knowledge from pre-training stage to downstream tasks. Experiments reveled that pre-training data with matching speaking style is more useful on downstream recognition tasks. A unified training objective with APC and MPC provided 8.46% relative error reduction on streaming model trained on HKUST. Also, the combination of target data adaption and layer-wise discriminative training helped the knowledge transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over a strong baseline.

本文通过进一步研究Masked Predictive Coding的三个重要方面：预训练数据的发言风格，对流式模型的扩展和如何更好地将预训练阶段的知识转移，实验证明，在下游识别任务中，预训练数据与匹配的发言风格更有用，使用APC和MPC的统一培训目标在HKUST上训练的流式模型上提供了8.46％的相对错误率降低，目标数据适应和逐层判别式培训的组合有助于MPC的知识转移，在AISHELL上相对基线实现了3.99％的错误率降低。

Transformer基于语音识别的无监督预训练的进一步研究