Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to interpret. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance and also the pre-training efficiency, either through decoding with a hybrid ASR system to generate phoneme-level alignments (named PBERT), or performing clustering on the supervised speech features extracted from an end-to-end CTC model (named CTC clustering). Both the hybrid and CTC models are trained on the same small amount of labeled speech as used in fine-tuning. Experiments demonstrate significant superiority of our methods to various SSL and self-training baselines, with up to 17.0% relative WER reduction. Our pre-trained models also show good transferability in a non-ASR speech task.

本论文提出了两种监督引导的码本生成方法，分别是使用混合ASR系统解码并生成音素级别对齐（命名为PBERT）或者使用从端到端CTC模型中提取的受监督语音特征进行聚类（命名为CTC聚类），以提高自动语音识别性能和预训练效率。实验结果表明，我们的方法在各种SSL和自训练基线中具有显著的优越性，最高WER相对降低了17.0％。我们的预训练模型在非ASR语音任务中也表现出良好的可迁移性。

基于监督引导的编码本，用于语音预训练中的遮蔽预测