Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to
arbitrary unseen target speaker timbre, while keeping the linguistic content
unchanged. Although the voice of generated speech can be controlled by
providing the speaker embedding of the target speaker, the speaker similarity
still lags behind the ground truth recordings. In this paper, we propose
SEF-VC, a speaker embedding free voice conversion model, which is designed to
learn and incorporate speaker timbre from reference speech via a powerful
position-agnostic cross-attention mechanism, and then reconstruct waveform from
HuBERT semantic tokens in a non-autoregressive manner. The concise design of
SEF-VC enhances its training stability and voice conversion performance.
Objective and subjective evaluations demonstrate the superiority of SEF-VC to
generate high-quality speech with better similarity to target reference than
strong zero-shot VC baselines, even for very short reference speeches.

SEF-VC 是一种无需说话者嵌入的语音转换模型，通过强大的位置不可知的跨注意力机制从参考语音中学习和融入说话者音色，并以非自回归的方式从 HuBERT 语义标记中重建波形，提高了稳定性和语音转换性能。客观和主观评价证明了 SEF-VC 相对于强零样本 VC 基线的优越性，在生成高质量语音时与目标参考的相似性更好，即使对于非常短的参考讲话。

SEF-VC：无说话人嵌入的零样本声音转换与交叉注意力

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross  Attention

We formulated non-speech vocalization (NSV) modeling as a text-to-speech task
and verified its viability. Specifically, we evaluated the phonetic
expressivity of HUBERT speech units on NSVs and verified our model's ability to
control over speaker timbre even though the training data is speaker few-shot.
In addition, we substantiated that the heterogeneity in recording conditions is
the major obstacle for NSV modeling. Finally, we discussed five improvements
over our method for future research. Audio samples of synthesized NSVs are
available on our demo page: this https URL.

本文研究了非语言声音（NSV）建模作为文本转语音任务的可行性，评估了 HUBERT 语音单元在 NSVs 上的语音表现力和模型控制扩展演讲者音色的能力，还探讨了实现 NSV 建模的障碍，提出了五种未来研究改进方法。