This paper presents VoiceLDM, a model designed to produce audio that
accurately follows two distinct natural language text prompts: the description
prompt and the content prompt. The former provides information about the
overall environmental context of the audio, while the latter conveys the
linguistic content. To achieve this, we adopt a text-to-audio (TTA) model based
on latent diffusion models and extend its functionality to incorporate an
additional content prompt as a conditional input. By utilizing pretrained
contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained
on large amounts of real-world audio without manual annotations or
transcriptions. Additionally, we employ dual classifier-free guidance to
further enhance the controllability of VoiceLDM. Experimental results
demonstrate that VoiceLDM is capable of generating plausible audio that aligns
well with both input conditions, even surpassing the speech intelligibility of
the ground truth audio on the AudioCaps test set. Furthermore, we explore the
text-to-speech (TTS) and zero-shot text-to-audio capabilities of VoiceLDM and
show that it achieves competitive results. Demos and code are available at
this https URL

VoiceLDM 是一个基于潜在扩散模型的文本到音频模型，通过结合描述提示和内容提示，能够生成与输入条件对齐的逼真音频，并展示了在语音智力测试集上甚至超过参考音频的结果，同时还探索了 VoiceLDM 的文本到语音和零样本文本到音频的能力。