This paper presents an end-to-end model designed to improve automatic speech
recognition (ASR) for a particular speaker in a crowded, noisy environment. The
model utilizes a single-channel speech enhancement module that isolates the
speaker's voice from background noise, along with an ASR module. Through this
approach, the model is able to decrease the word error rate (WER) of ASR from
80% to 26.4%. Typically, these two components are adjusted independently due to
variations in data requirements. However, speech enhancement can create
anomalies that decrease ASR efficiency. By implementing a joint fine-tuning
strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5%
in joint tuning.

该论文提出了一种端到端模型，旨在改进在拥挤嘈杂环境中特定讲话者的自动语音识别（ASR）。该模型利用单通道语音增强模块将讲话者的声音与背景噪音隔离，并结合了 ASR 模块。通过这种方法，该模型能够将 ASR 的识别错误率（WER）从 80％降低到 26.4％。通常，由于数据要求的差异，这两个组件是独立调整的。然而，语音增强可能会引起降低 ASR 效率的异常情况。通过实施联合精调策略，该模型可以将单独调整中的 WER 从 26.4％降低到 14.5％。

Convoifilter：一项关于进行鸡尾酒会话语音识别的案例研究

Convoifilter: A case study of doing cocktail party speech recognition

Time-domain single-channel speech enhancement (SE) still remains challenging
to extract the target speaker without any prior information on multi-talker
conditions. It has been shown via auditory attention decoding that the brain
activity of the listener contains the auditory information of the attended
speaker. In this paper, we thus propose a novel time-domain brain-assisted SE
network (BASEN) incorporating electroencephalography (EEG) signals recorded
from the listener for extracting the target speaker from monaural speech
mixtures. The proposed BASEN is based on the fully-convolutional time-domain
audio separation network. In order to fully leverage the complementary
information contained in the EEG signals, we further propose a convolutional
multi-layer cross attention module to fuse the dual-branch features.
Experimental results on a public dataset show that the proposed model
outperforms the state-of-the-art method in several evaluation metrics. The
reproducible code is available at this https URL

本研究提出了一种基于脑电图信号的时域单通道语音增强网络（BASEN），用于从混响环境中提取目标讲话者的语音，并且实验结果显示，该方法在多项评估指标上表现优于现有方法。

基于时间域的大脑辅助言语增强网络：在多说话人条件下使用卷积交叉注意力

BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with  Convolutional Cross Attention in Multi-talker Conditions

Multi-frame algorithms for single-channel speech enhancement are able to take
advantage from short-time correlations within the speech signal. Deep Filtering
(DF) was proposed to directly estimate a complex filter in frequency domain to
take advantage of these correlations. In this work, we present a real-time
speech enhancement demo using DeepFilterNet. DeepFilterNet's efficiency is
enabled by exploiting domain knowledge of speech production and psychoacoustic
perception. Our model is able to match state-of-the-art speech enhancement
benchmarks while achieving a real-time-factor of 0.19 on a single threaded
notebook CPU. The framework as well as pretrained weights have been published
under an open source license.

本文介绍了使用 DeepFilterNet 进行实时语音增强的演示。通过利用言语生产和心理声学感知的领域知识，该模型能够匹配最先进的语音增强基准，并在单线程笔记本 CPU 上实现了实时化因子 0.19。该框架及预训练权重已在开源协议下发布。