In this paper, we propose MakeSinger, a semi-supervised training method for
singing voice synthesis (SVS) via classifier-free diffusion guidance. The
challenge in SVS lies in the costly process of gathering aligned sets of text,
pitch, and audio data. MakeSinger enables the training of the diffusion-based
SVS model from any speech and singing voice data regardless of its labeling,
thereby enhancing the quality of generated voices with large amount of
unlabeled data. At inference, our novel dual guiding mechanism gives text and
pitch guidance on the reverse diffusion step by estimating the score of masked
input. Experimental results show that the model trained in a semi-supervised
manner outperforms other baselines trained only on the labeled data in terms of
pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate
that by adding Text-to-Speech (TTS) data in training, the model can synthesize
the singing voices of TTS speakers even without their singing voices.

通过无分类器扩散引导的 MakeSinger 半监督训练方法，提高合成的歌声质量，并展示即使在无歌声数据的情况下，通过训练文字转语音 (TTS) 数据的模型仍可以合成 TTS 说话者的歌声。

MakeSinger: 一种用于数据高效的半监督训练方法的歌声合成，通过无分类器扩散引导

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing  Voice Synthesis via Classifier-free Diffusion Guidance

Large Language Models (LLMs) are one of the most promising technologies for
the next era of speech generation systems, due to their scalability and
in-context learning capabilities. Nevertheless, they suffer from multiple
stability issues at inference time, such as hallucinations, content skipping or
speech repetitions. In this work, we introduce a new self-supervised Voice
Conversion (VC) architecture which can be used to learn to encode transitory
features, such as content, separately from stationary ones, such as speaker ID
or recording conditions, creating speaker-disentangled representations. Using
speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the
LLM to generate the content and the style of the speech only from the text,
similarly to humans, while the speaker identity is provided by the decoder of
the VC model. Results show that LLMs trained over speaker-disentangled
self-supervised representations provide an improvement of 4.7pp in speaker
similarity over SOTA entangled representations, and a word error rate (WER)
5.4pp lower. Furthermore, they achieve higher naturalness than human recordings
of the LibriTTS test-other dataset. Finally, we show that using explicit
reference embedding negatively impacts intelligibility (stability), with WER
increasing by 14pp compared to the model that only uses text to infer the
style.

在这项研究中，我们介绍了一种新的自监督语音转换（VC）架构，它可以用来学习将瞬时特征，如内容，与静态特征（如说话者 ID 或录音条件）分开进行编码，从而创建说话者解耦的表示。结果表明，训练过以说话者解耦的自监督表示的 Large Language Models（LLMs）相比于最先进的关联表示提高了 4.7 个百分点的说话者相似度，并降低了 5.4 个百分点的词错误率（WER）。此外，它们在自然性方面比 LibriTTS 测试集中的人类录音表现更好。最后，我们表明使用明确的参考嵌入对可读性（稳定性）产生负面影响，与仅使用文本来推断风格的模型相比，WER 增加了 14 个百分点。

通过自监督表示增强基于 LLM 的语音生成系统的稳定性

Enhancing the Stability of LLM-based Speech Generation Systems through  Self-Supervised Representations

Creating realistic and natural-sounding synthetic speech remains a big
challenge for voice identities unseen during training. As there is growing
interest in synthesizing voices of new speakers, here we investigate the
ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC)
modes to extrapolate from speakers observed during training to create unseen
speaker identities. Firstly, we create an approach for TTS and VC, and then we
comprehensively evaluate our methods and baselines in terms of intelligibility,
naturalness, speaker similarity, and ability to create new voices. We use both
objective and subjective metrics to benchmark our techniques on 2 evaluation
tasks: zero-shot and new voice speech synthesis. The goal of the former task is
to measure the precision of the conversion to an unseen voice. The goal of the
latter is to measure the ability to create new voices. Extensive evaluations
demonstrate that the proposed approach systematically allows to obtain
state-of-the-art performance in zero-shot speech synthesis and creates various
new voices, unobserved in the training set. We consider this work to be the
first attempt to synthesize new voices based on mel-spectrograms and
normalizing flows, along with a comprehensive analysis and comparison of the
TTS and VC modes.

通过归一化流（normalizing flows）实现从训练时未见过的声音身份合成逼真、自然的合成语音的研究中，我们创建了一种文本转语音（TTS）和语音转换（VC）的方法，并使用客观和主观指标来评估技术在零样本和新声音语音合成任务中的性能，实验证明该方法能够在零样本语音合成和创造未在训练集中出现的多种新声音方面取得最先进的性能。