Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC scores. These results validate our hypothesis that a multilingual pre-trained ASR encoder, combined with joint loss optimization, substantially improves speaker identification performance in non-English languages.

本研究针对多语言环境中的说话人识别问题，提出了一种新颖的WSI框架，利用在丰富多语言数据上预训练的Whisper自动语音识别模型的编码器，通过联合损失优化策略生成稳健的说话人嵌入。研究发现，该方法在多种语言和录音条件下优于现有的最新基准，提高了识别性能，尤其在非英语语言中的表现显著。

低语者识别：利用预训练多语言变换器实现稳健的说话人嵌入