Automatic Speech Recognition (ASR) still face challenges when recognizing
time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model
towards such contextually-relevant phrases. During training, a list of biasing
phrases are selected from a large pool of phrases following a sampling
strategy. In this work we firstly analyse different sampling strategies to
provide insights into the training of CB for ASR with correlation plots between
the bias embeddings among various training stages. Secondly, we introduce a
neighbourhood attention (NA) that localizes self attention (SA) to the nearest
neighbouring frames to further refine the CB output. The results show that this
proposed approach provides on average a 25.84% relative WER improvement on
LibriSpeech sets and rare-word evaluation compared to the baseline.

通过分析不同的采样策略和相关性图，本文首先对上下文偏置模块的训练进行了探究。其次，引入了邻居注意力机制来进一步优化上下文偏置的输出，实验结果表明相对于基准模型，在 LibriSpeech 数据集和稀有单词评估上平均相对词错误率提升了 25.84%。

增强本地性动态偏置和采样策略用于上下文自动语音识别

Locality enhanced dynamic biasing and sampling strategies for contextual  ASR

Neural speaker embeddings encode the speaker's speech characteristics through
a DNN model and are prevalent for speaker verification tasks. However, few
studies have investigated the usage of neural speaker embeddings for an ASR
system. In this work, we present our efforts w.r.t integrating neural speaker
embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved
embedding extraction pipeline in combination with the Weighted-Simple-Add
integration method results in x-vector and c-vector reaching on par performance
with i-vectors. We further compare and analyze different speaker embeddings. We
present our acoustic model improvements obtained by switching from newbob
learning rate schedule to one cycle learning schedule resulting in a ~3%
relative WER reduction on Switchboard, additionally reducing the overall
training time by 17%. By further adding neural speaker embeddings, we gain
additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based
hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and
Hub5'01 with training on SWB 300h.

本文研究了将神经说话者嵌入用于一个 ASR 系统，并通过基于 Conformer 的混合 HMM ASR 系统，在使用加权简单加法（Weighted-Simple-Add）集成方法时，展示了改进的嵌入提取流程，通过比较和分析不同的说话者嵌入来获得声学模型的改进，最终将最佳的 Conformer-based 混合 ASR 系统与说话者嵌入结合起来，获得了 9.0％的 WER 并在 Hub5'00 和 Hub5'01 上进行训练。

改进和分析用于 ASR 的神经说话人嵌入

Improving And Analyzing Neural Speaker Embeddings for ASR

While Self-Supervised Learning has helped reap the benefit of the scale from
the available unlabeled data, the learning paradigms are continuously being
bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which
uses clustering and an augmentation-based cross-contrastive loss as its
self-supervised objective. Through the clustering module, we scale down the
influence of those negative examples that are highly similar to the positive.
The Cross-Contrastive loss is computed between the encoder output of the
original sample and the quantizer output of its augmentation and vice-versa,
bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves up
to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on
the test-clean and test-other sets, respectively, of LibriSpeech, without the
use of any language model. The proposed method also achieves up to 14.9%
relative WER improvement over the baseline wav2vec 2.0 when fine-tuned on
Switchboard data. We make all our codes publicly available on GitHub.

提出了一种名为 ccc-wav2vec 2.0 的新的自监督预训练策略，该方法使用聚类和基于数据增强的相交对比损失作为自监督目标，并取得了约 15.6% 和 12.7% 的 WER 相对改进，也可在 Switchboard 数据上获得最高 14.9% 的 WER 相对改进。