Humans possess an extraordinary ability to selectively focus on the sound
source of interest amidst complex acoustic environments, commonly referred to
as cocktail party scenarios. In an attempt to replicate this remarkable
auditory attention capability in machines, target speaker extraction (TSE)
models have been developed. These models leverage the pre-registered cues of
the target speaker to extract the sound source of interest. However, the
effectiveness of these models is hindered in real-world scenarios due to the
potential variation or even absence of pre-registered cues. To address this
limitation, this study investigates the integration of natural language to
enhance the flexibility and controllability of existing TSE models.
Specifically, we propose a model named LLM-TSE, wherein a large language model
(LLM) to extract useful semantic cues from the user's typed text input, which
can complement the pre-registered cues or work independently to control the TSE
process. Our experimental results demonstrate competitive performance when only
text-based cues are presented, and a new state-of-the-art is set when combined
with pre-registered acoustic cues. To the best of our knowledge, this is the
first work that has successfully incorporated text-based cues to guide target
speaker extraction, which can be a cornerstone for cocktail party problem
research.

通过结合自然语言处理，本研究提出了一种名为 LLM-TSE 的模型，可以提取用户输入的文本信息中的有用语义线索，辅助预注册线索或独立控制目标说话人提取过程。实验结果表明，当仅使用文本线索时，性能表现有竞争力，并且结合预注册声学线索时，创造了新的最先进水平。据我们所知，这是首个成功将文本线索纳入目标说话人提取任务的研究，可作为研究鸡尾酒会问题的基石。

在鸡尾酒会上输入以聆听：文本引导的目标说话人提取

Typing to Listen at the Cocktail Party: Text-Guided Target Speaker  Extraction

As a practical alternative of speech separation, target speaker extraction
(TSE) aims to extract the speech from the desired speaker using additional
speaker cue extracted from the speaker. Its main challenge lies in how to
properly extract and leverage the speaker cue to benefit the extracted speech
quality. The cue extraction method adopted in majority existing TSE studies is
to directly utilize discriminative speaker embedding, which is extracted from
the pre-trained models for speaker verification. Although the high speaker
discriminability is a most desirable property for speaker verification task, we
argue that it may be too sophisticated for TSE. In this study, we propose that
a simplified speaker cue with clear class separability might be preferred for
TSE. To verify our proposal, we introduce several forms of speaker cues,
including naive speaker embedding (such as, x-vector and xi-vector) and new
speaker embeddings produced from sparse LDA-transform. Corresponding TSE models
are built by integrating these speaker cues with SepFormer (one SOTA speech
separation model). Performances of these TSE models are examined on the
benchmark WSJ0-2mix dataset. Experimental results validate the effectiveness
and generalizability of our proposal, showing up to 9.9% relative improvement
in SI-SDRi. Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.78, our best TSE
system significantly outperforms the current SOTA systems and offers the top
TSE results reported till date on the WSJ0-2mix.

本文提出了一种基于简化说话人提示的目标说话人提取方法，通过在 SepFormer 模型中加入 X-vector、Xi-vector 和 LDA-transform 方法产生的新的说话人嵌入，显著提高了模型的性能。在 WSJ0-2mix 数据集上的实验结果表明，我们的方法的 SI-SDRi 可以达到 19.4 dB 和 PESQ 可以达到 3.78，比当前的 SOTA 模型有显著的改进，并提供了目前 WSJ0-2mix 最佳的 TSE 结果。

稀疏 LDA 转换的说话人嵌入在目标说话人提取中的应用

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

Audio-visual multi-modal modeling has been demonstrated to be effective in
many speech related tasks, such as speech recognition and speech enhancement.
This paper introduces a new time-domain audio-visual architecture for target
speaker extraction from monaural mixtures. The architecture generalizes the
previous TasNet (time-domain speech separation network) to enable multi-modal
learning and at meanwhile it extends the classical audio-visual speech
separation from frequency-domain to time-domain. The main components of
proposed architecture include an audio encoder, a video encoder that extracts
lip embedding from video streams, a multi-modal separation network and an audio
decoder. Experiments on simulated mixtures based on recently released LRS2
dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on
two- and three-speaker cases respectively, compared to audio-only TasNet and
frequency-domain audio-visual networks

本文介绍了一种新的时间域音视图架构，用于从单声道混合物中提取目标说话人，实验结果表明，相比于仅有声音的 TasNet 和频域音 - 视网络，我们的方法在两个和三个说话人的情况下分别可以提供 3dB + 和 4dB + 的信噪比改进。