Uncertainty modeling in speaker representation aims to learn the variability
present in speech utterances. While the conventional cosine-scoring is
computationally efficient and prevalent in speaker recognition, it lacks the
capability to handle uncertainty. To address this challenge, this paper
proposes an approach for estimating uncertainty at the speaker embedding
front-end and propagating it to the cosine scoring back-end. Experiments
conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the
proposed method in handling uncertainty arising from embedding estimation. It
achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF
compared to the conventional cosine similarity. It is also computationally
efficient in practice.

该论文提出了一种在说话者嵌入前端估计和余弦相似度评分后端传播不确定性的方法，实验证实了该方法在处理由嵌入估计引起的不确定性方面的有效性，与传统的余弦相似度相比，EER 和 minDCF 平均减少了 8.5% 和 9.8%，在实践中计算效率也较高。

神经发音者嵌入中的余弦评分与不确定性

Cosine Scoring with Uncertainty for Neural Speaker Embedding

Self-supervised learning (SSL) has attracted increased attention for learning
meaningful speech representations. Speech SSL models, such as WavLM, employ
masked prediction training to encode general-purpose representations. In
contrast, speaker SSL models, exemplified by DINO-based models, adopt
utterance-level training objectives primarily for speaker representation.
Understanding how these models represent information is essential for refining
model efficiency and effectiveness. Unlike the various analyses of speech SSL,
there has been limited investigation into what information speaker SSL captures
and how its representation differs from speech SSL or other fully-supervised
speaker models. This paper addresses these fundamental questions. We explore
the capacity to capture various speech properties by applying SUPERB evaluation
probing tasks to speech and speaker SSL models. We also examine which layers
are predominantly utilized for each task to identify differences in how speech
is represented. Furthermore, we conduct direct comparisons to measure the
similarities between layers within and across models. Our analysis unveils that
1) the capacity to represent content information is somewhat unrelated to
enhanced speaker representation, 2) specific layers of speech SSL models would
be partly specialized in capturing linguistic information, and 3) speaker SSL
models tend to disregard linguistic information but exhibit more sophisticated
speaker representation.

该研究探索了自监督学习模型在捕捉语音和说话者表示方面的能力，并发现具体层次的语音模型更专注于捕捉语言信息，而说话者模型则更注重对说话者表示的提炼。

自我监督的语音和说话者模型学到了什么？来自跨模型逐层分析的新发现

What Do Self-Supervised Speech and Speaker Models Learn? New Findings  From a Cross Model Layer-Wise Analysis

This paper investigates a self-adaptation method for speech enhancement using
auxiliary speaker-aware features; we extract a speaker representation used for
adaptation directly from the test utterance. Conventional studies of deep
neural network (DNN)--based speech enhancement mainly focus on building a
speaker independent model. Meanwhile, in speech applications including speech
recognition and synthesis, it is known that model adaptation to the target
speaker improves the accuracy. Our research question is whether a DNN for
speech enhancement can be adopted to unknown speakers without any auxiliary
guidance signal in test-phase. To achieve this, we adopt multi-task learning of
speech enhancement and speaker identification, and use the output of the final
hidden layer of speaker identification branch as an auxiliary feature. In
addition, we use multi-head self-attention for capturing long-term dependencies
in the speech and noise. Experimental results on a public dataset show that our
strategy achieves the state-of-the-art performance and also outperform
conventional methods in terms of subjective quality.

该论文研究了一种使用辅助说话者感知特征的自适应语音增强方法，从测试话语中直接提取用于适应的说话者表示。采用多任务学习的语音增强和说话人识别，并使用说话人识别分支的最终隐藏层输出作为辅助特征。此外，采用多头自注意力机制捕捉语音和噪声的长期依赖关系。在公共数据集上的实验结果表明，该策略实现了最先进的性能，并在主观质量方面优于传统方法。