Voice conversion (VC) aims at altering a person's voice to make it sound
similar to the voice of another person while preserving linguistic content.
Existing methods suffer from a dilemma between content intelligibility and
speaker similarity; i.e., methods with higher intelligibility usually have a
lower speaker similarity, while methods with higher speaker similarity usually
require plenty of target speaker voice data to achieve high intelligibility. In
this work, we propose a novel method \textit{Phoneme Hallucinator} that
achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model;
it adopts a novel model to hallucinate diversified and high-fidelity target
speaker phonemes based just on a short target speaker voice (e.g. 3 seconds).
The hallucinated phonemes are then exploited to perform neighbor-based voice
conversion. Our model is a text-free, any-to-any VC model that requires no text
annotations and supports conversion to any unseen speaker. Objective and
subjective evaluations show that \textit{Phoneme Hallucinator} outperforms
existing VC methods for both intelligibility and speaker similarity.

提出了一种新颖的方法 “音素幻觉生成器”，它可以在仅有目标说话者短音频数据的情况下，生成多样且高保真度的目标音素，从而在语音转换中实现高逼真度和说话者相似度的平衡。

音素幻像器：通过集合扩展的单次语音转换

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

We present a novel way of conditioning a pretrained denoising diffusion
speech model to produce speech in the voice of a novel person unseen during
training. The method requires a short (~3 seconds) sample from the target
person, and generation is steered at inference time, without any training
steps. At the heart of the method lies a sampling process that combines the
estimation of the denoising model with a low-pass version of the new speaker's
sample. The objective and subjective evaluations show that our sampling method
can generate a voice similar to that of the target speaker in terms of
frequency, with an accuracy comparable to state-of-the-art methods, and without
training.

本文提出了一种新的方法，通过采样识别新目标的自然语音数据，并在推理期间利用加噪扩散语音模型生成具有目标讲话者相似声音的音频，而不需要进行任何训练步骤。