We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain
architecture for single-channel target-speaker automatic speech recognition
(TS-ASR). The model consists of a TitaNet based speaker embedding module, a
Conformer based masking as well as ASR modules. These modules are jointly
optimized to transcribe a target-speaker, while ignoring speech from other
speakers. For training we use Connectionist Temporal Classification (CTC) loss
and introduce a scale-invariant spectrogram reconstruction loss to encourage
the model better separate the target-speaker's spectrogram from mixture. We
obtain state-of-the-art target-speaker word error rate (TS-WER) on
WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on
WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.2%) and LibriSpeech3Mix (7.6%)
datasets, establishing new benchmarks for TS-ASR. The proposed model will be
open-sourced through NVIDIA NeMo toolkit.

我们提出了 CONF-TSASR，这是一种非自回归的端到端时间 - 频率域架构，用于单通道目标人说话者自动语音识别（TS-ASR）。该模型包括基于 TitaNet 的说话者嵌入模块，基于 Conformer 的掩蔽和 ASR 模块，通过联合优化这些模块来转录目标说话者的语音，忽略其他讲话者的语音。通过使用连接主义时间分类（CTC）损失进行训练，并引入一种比例不变的频谱重建损失来鼓励模型更好地将目标说话者的频谱与混合音频分离。在 WSJ0-2mix-extr（4.2％）数据集上，我们获得了最先进的目标说话者词错误率（TS-WER）。此外，我们首次报告了 WSJ0-3mix-extr（12.4％），LibriSpeech2Mix（4.2％）和 LibriSpeech3Mix（7.6％）数据集上的 TS-WER，为 TS-ASR 建立了新的基准。所提出的模型将通过 NVIDIA NeMo 工具包开源。