In this paper, we propose a novel score-base generative model for unconditional raw audio synthesis. Our proposal builds upon the latest developments on diffusion process modeling with stochastic differential equations, which already demonstrated promising results on image generation. We motivate novel heuristics for the choice of the diffusion processes better suited for audio generation, and consider the use of a conditional U-Net to approximate the score function. While previous approaches on diffusion models on audio were mainly designed as speech vocoders in medium resolution, our method termed CRASH (Controllable Raw Audio Synthesis with High-resolution) allows us to generate short percussive sounds in 44.1kHz in a controllable way. Through extensive experiments, we showcase on a drum sound generation task the numerous sampling schemes offered by our method (unconditional generation, deterministic generation, inpainting, interpolation, variations, class-conditional sampling) and propose the class-mixing sampling, a novel way to generate "hybrid" sounds. Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities with lighter and easier-to-train models.

本文提出了一种基于得分的生成模型，利用扩散过程建模和条件U-Net逼近得分函数实现音频合成。该方法可以以高分辨率44.1kHz可控生成短小的打击声音，并且适用于多种采样方案，包括类条件采样和杂交声音生成。相比其他基于GAN的方法，该方法模型轻便、易于训练。

CRASH: 基于原始音频评分的生成式建模，用于可控的高分辨率鼓声合成