Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to
arbitrary unseen target speaker timbre, while keeping the linguistic content
unchanged. Although the voice of generated speech can be controlled by
providing the speaker embedding of the target speaker, the speaker similarity
still lags behind the ground truth recordings. In this paper, we propose
SEF-VC, a speaker embedding free voice conversion model, which is designed to
learn and incorporate speaker timbre from reference speech via a powerful
position-agnostic cross-attention mechanism, and then reconstruct waveform from
HuBERT semantic tokens in a non-autoregressive manner. The concise design of
SEF-VC enhances its training stability and voice conversion performance.
Objective and subjective evaluations demonstrate the superiority of SEF-VC to
generate high-quality speech with better similarity to target reference than
strong zero-shot VC baselines, even for very short reference speeches.

SEF-VC 是一种无需说话者嵌入的语音转换模型，通过强大的位置不可知的跨注意力机制从参考语音中学习和融入说话者音色，并以非自回归的方式从 HuBERT 语义标记中重建波形，提高了稳定性和语音转换性能。客观和主观评价证明了 SEF-VC 相对于强零样本 VC 基线的优越性，在生成高质量语音时与目标参考的相似性更好，即使对于非常短的参考讲话。

SEF-VC：无说话人嵌入的零样本声音转换与交叉注意力

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross  Attention

This paper presents a novel task, zero-shot voice conversion based on face
images (zero-shot FaceVC), which aims at converting the voice characteristics
of an utterance from any source speaker to a newly coming target speaker,
solely relying on a single face image of the target speaker. To address this
task, we propose a face-voice memory-based zero-shot FaceVC method. This method
leverages a memory-based face-voice alignment module, in which slots act as the
bridge to align these two modalities, allowing for the capture of voice
characteristics from face images. A mixed supervision strategy is also
introduced to mitigate the long-standing issue of the inconsistency between
training and inference phases for voice conversion tasks. To obtain
speaker-independent content-related representations, we transfer the knowledge
from a pretrained zero-shot voice conversion model to our zero-shot FaceVC
model. Considering the differences between FaceVC and traditional voice
conversion tasks, systematic subjective and objective metrics are designed to
thoroughly evaluate the homogeneity, diversity and consistency of voice
characteristics controlled by face images. Through extensive experiments, we
demonstrate the superiority of our proposed method on the zero-shot FaceVC
task. Samples are presented on our demo website.

一个基于面部图像的零样本语音转换任务中，提出了一种新颖的零样本面部语音转换方法，通过使用面部 - 语音对齐模块和混合监督策略来实现从一个源说话者到一个目标说话者的语音特征转换，并引入预训练的零样本语音转换模型，通过大量实验证明了该方法在零样本面部语音转换任务中的优越性。

基于面部驱动的零射声音转换与基于记忆的面音对齐

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice  Alignment

In recent years, large-scale pre-trained speech language models (SLMs) have
demonstrated remarkable advancements in various generative speech modeling
applications, such as text-to-speech synthesis, voice conversion, and speech
enhancement. These applications typically involve mapping text or speech inputs
to pre-trained SLM representations, from which target speech is decoded. This
paper introduces a new approach, SLMGAN, to leverage SLM representations for
discriminative tasks within the generative adversarial network (GAN) framework,
specifically for voice conversion. Building upon StarGANv2-VC, we add our novel
SLM-based WavLM discriminators on top of the mel-based discriminators along
with our newly designed SLM feature matching loss function, resulting in an
unsupervised zero-shot voice conversion system that does not require text
labels during training. Subjective evaluation results show that SLMGAN
outperforms existing state-of-the-art zero-shot voice conversion models in
terms of naturalness and achieves comparable similarity, highlighting the
potential of SLM-based discriminators for related applications.

介绍了一种新的方法 SLMGAN，它利用 SLM（大规模预训练的语音语言模型）在生成对抗网络（GAN）框架中实现鉴别任务，特别是用于语音转换。通过在基于 mel 的鉴别器之上添加基于 SLM 的 WavLM 鉴别器，并结合新设计的 SLM 特征匹配损失函数，实现了一种无监督的零样本语音转换系统，培训过程中不需要文本标签。主观评估结果表明，SLMGAN 在自然度方面优于现有的零样本语音转换模型，并达到了相似性方面的可比较水平，突显了基于 SLM 的鉴别器在相关应用中的潜力。