Multi-talker overlapped speech recognition remains a significant challenge,
requiring not only speech recognition but also speaker diarization tasks to be
addressed. In this paper, to better address these tasks, we first introduce
speaker labels into an autoregressive transformer-based speech recognition
model to support multi-speaker overlapped speech recognition. Then, to improve
speaker diarization, we propose a novel speaker mask branch to detection the
speech segments of individual speakers. With the proposed model, we can perform
both speech recognition and speaker diarization tasks simultaneously using a
single model. Experimental results on the LibriSpeech-based overlapped dataset
demonstrate the effectiveness of the proposed method in both speech recognition
and speaker diarization tasks, particularly enhancing the accuracy of speaker
diarization in relatively complex multi-talker scenarios.

该研究论文提出了一种新的模型，结合语音识别和说话人分离的任务，通过引入说话人标签和说话人掩码分支，实现了多说话人重叠语音的识别和说话人分离。实验证明了该方法在复杂的多说话人场景中有效地提高了说话人分离的准确性。

多说话人重叠语音识别的演讲者掩蔽变换器

Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition

Automatic recognition of overlapped speech remains a highly challenging task
to date. Motivated by the bimodal nature of human speech perception, this paper
investigates the use of audio-visual technologies for overlapped speech
recognition. Three issues associated with the construction of audio-visual
speech recognition (AVSR) systems are addressed. First, the basic architecture
designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second,
purposefully designed modality fusion gates are used to robustly integrate the
audio and visual features. Third, in contrast to a traditional pipelined
architecture containing explicit speech separation and recognition components,
a streamlined and integrated AVSR system optimized consistently using the
lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The
proposed LF-MMI time-delay neural network (TDNN) system establishes the
state-of-the-art for the LRS2 dataset. Experiments on overlapped speech
simulated from the LRS2 dataset suggest the proposed AVSR system outperformed
the audio only baseline LF-MMI DNN system by up to 29.98\% absolute in word
error rate (WER) reduction, and produced recognition performance comparable to
a more complex pipelined system. Consistent performance improvements of 4.89\%
absolute in WER reduction over the baseline AVSR system using feature fusion
are also obtained.

本研究探讨了利用视听技术识别重叠的语音信息的三个问题，包括基本体系结构设计，模态融合的建议设计门，以及通过优化的统一的方法来建立 AVSR 系统。实验结果表明，该系统在 LRS2 数据集上的性能超过了传统的语音分离和识别组件的流水线架构，可获得相对于仅音频的基线 LF-MMI DNN 系统高达 29.98% 的字错误率（WER）降低，而采用特征融合技术的 AVSR 系统相对于基准系统进一步提高了 4.89% 的 WER 降低。

LRS2 数据集中重叠语音的音视频识别

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Unsupervised single-channel overlapped speech recognition is one of the
hardest problems in automatic speech recognition (ASR). Permutation invariant
training (PIT) is a state of the art model-based approach, which applies a
single neural network to solve this single-input, multiple-output modeling
problem. We propose to advance the current state of the art by imposing a
modular structure on the neural network, applying a progressive pretraining
regimen, and improving the objective function with transfer learning and a
discriminative training criterion. The modular structure splits the problem
into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing,
and speech recognition. The pretraining regimen uses these modules to solve
progressively harder tasks. Transfer learning leverages parallel clean speech
to improve the training targets for the network. Our discriminative training
formulation is a modification of standard formulations, that also penalizes
competing outputs of the system. Experiments are conducted on the artificial
overlapped Switchboard and hub5e-swb dataset. The proposed framework achieves
over 30% relative improvement of WER over both a strong jointly trained system,
PIT for ASR, and a separately optimized system, PIT for speech separation with
clean speech ASR model. The improvement comes from better model generalization,
training efficiency and the sequence level linguistic knowledge integration.

提出一种基于模块化结构、渐进式预训练、转移学习以及鉴别性训练标准的神经网络模型，相较于现有模型，该模型在解决无监督单通道重叠语音识别方面表现更为优秀，能够取得超过 30% 的远程词错误率相对改进。