We introduce MAGNeT, a masked generative sequence modeling method that
operates directly over several streams of audio tokens. Unlike prior work,
MAGNeT is comprised of a single-stage, non-autoregressive transformer. During
training, we predict spans of masked tokens obtained from a masking scheduler,
while during inference we gradually construct the output sequence using several
decoding steps. To further enhance the quality of the generated audio, we
introduce a novel rescoring method in which, we leverage an external
pre-trained model to rescore and rank predictions from MAGNeT, which will be
then used for later decoding steps. Lastly, we explore a hybrid version of
MAGNeT, in which we fuse between autoregressive and non-autoregressive models
to generate the first few seconds in an autoregressive manner while the rest of
the sequence is being decoded in parallel. We demonstrate the efficiency of
MAGNeT for the task of text-to-music and text-to-audio generation and conduct
an extensive empirical evaluation, considering both objective metrics and human
studies. The proposed approach is comparable to the evaluated baselines, while
being significantly faster (x7 faster than the autoregressive baseline).
Through ablation studies and analysis, we shed light on the importance of each
of the components comprising MAGNeT, together with pointing to the trade-offs
between autoregressive and non-autoregressive modeling, considering latency,
throughput, and generation quality. Samples are available on our demo page
this https URL

我们介绍了 MAGNeT，一种遮蔽生成序列建模方法，直接操作多个音频令牌流。MAGNeT 由一个单阶段的非自回归变换器组成，在训练过程中，我们预测来自掩码调度器的遮蔽令牌跨度，而在推断过程中，我们逐步构建输出序列使用多个解码步骤。为了进一步提高生成音频的质量，我们引入了一种新的再评分方法，其中我们利用外部预训练模型对 MAGNeT 的预测进行再评分和排序，然后用于后续的解码步骤。最后，我们探索了 MAGNeT 的混合版本，其中我们在自回归方式下生成前几秒，而其余序列则并行解码。我们展示了 MAGNeT 在文本转音乐和文本转音频生成任务中的效率，并进行了大量的实证评估，考虑客观指标和人类研究。所提出的方法与评估基线相当，而且速度显著更快（比自回归基线快 7 倍）。通过消融研究和分析，我们阐明了构成 MAGNeT 的每个组成部分的重要性，并指出了自回归和非自回归建模之间的权衡，考虑到延迟、吞吐量和生成质量。我们的演示页面上提供了样本，位于此 https URL。

使用单一非自回归变换器生成掩盖音频

Masked Audio Generation using a Single Non-Autoregressive Transformer

Music generation has attracted growing interest with the advancement of deep
generative models. However, generating music conditioned on textual
descriptions, known as text-to-music, remains challenging due to the complexity
of musical structures and high sampling rate requirements. Despite the task's
significance, prevailing generative models exhibit limitations in music
quality, computational efficiency, and generalization. This paper introduces
JEN-1, a universal high-fidelity model for text-to-music generation. JEN-1 is a
diffusion model incorporating both autoregressive and non-autoregressive
training. Through in-context learning, JEN-1 performs various generation tasks
including text-guided music generation, music inpainting, and continuation.
Evaluations demonstrate JEN-1's superior performance over state-of-the-art
methods in text-music alignment and music quality while maintaining
computational efficiency. Our demos are available at
this http URL

JEN-1 是一个通用高保真度的文本到音乐生成模型，通过整合自回归和非自回归训练技术，并且通过上下文学习实现文本引导的音乐生成、音乐修复和延续等多种生成任务，相对于先前的方法，在文本与音乐对齐、音乐质量和计算效率方面展现出优越的性能。