Diffusion models have recently been shown to be relevant for high-quality
speech generation. Most work has been focused on generating spectrograms, and
as such, they further require a subsequent model to convert the spectrogram to
a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic
end-to-end model for generating a raw speech waveform. The proposed model is
autoregressive, generating overlapping frames sequentially, where each frame is
conditioned on a portion of the previously generated one. Hence, our model can
effectively synthesize an unlimited speech duration while preserving
high-fidelity synthesis and temporal coherence. We implemented the proposed
model for unconditional and conditional speech generation, where the latter can
be driven by an input sequence of phonemes, amplitudes, and pitch values.
Working on the waveform directly has some empirical advantages. Specifically,
it allows the creation of local acoustic behaviors, like vocal fry, which makes
the overall waveform sounds more natural. Furthermore, the proposed diffusion
model is stochastic and not deterministic; therefore, each inference generates
a slightly different waveform variation, enabling abundance of valid
realizations. Experiments show that the proposed model generates speech with
superior quality compared with other state-of-the-art neural speech generation
systems.

本文提出了一种基于扩散的概率端到端模型，用于生成原始语音波形，该模型通过自回归的方式顺序生成重叠帧，可以实现无限语音时长的合成，并保持高保真度和时间连贯性，通过直接处理波形具有优势，可以创建局部声学行为，同时该模型是随机的，生成略有差异的波形变体，实验结果表明相较于其他最先进的神经语音生成系统，所提出的模型具有更高的合成质量。

DiffAR: 去噪扩散自回归模型用于原始语音波形生成

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform  Generation

Recent large-scale text-driven synthesis models have attracted much attention
thanks to their remarkable capabilities of generating highly diverse images
that follow given text prompts. Such text-based synthesis methods are
particularly appealing to humans who are used to verbally describe their
intent. Therefore, it is only natural to extend the text-driven image synthesis
to text-driven image editing. Editing is challenging for these generative
models, since an innate property of an editing technique is to preserve most of
the original image, while in the text-based models, even a small modification
of the text prompt often leads to a completely different outcome.
State-of-the-art methods mitigate this by requiring the users to provide a
spatial mask to localize the edit, hence, ignoring the original structure and
content within the masked region. In this paper, we pursue an intuitive
prompt-to-prompt editing framework, where the edits are controlled by text
only. To this end, we analyze a text-conditioned model in depth and observe
that the cross-attention layers are the key to controlling the relation between
the spatial layout of the image to each word in the prompt. With this
observation, we present several applications which monitor the image synthesis
by editing the textual prompt only. This includes localized editing by
replacing a word, global editing by adding a specification, and even delicately
controlling the extent to which a word is reflected in the image. We present
our results over diverse images and prompts, demonstrating high-quality
synthesis and fidelity to the edited prompts.

该文提出了一种基于文本的图像编辑框架，利用交叉注意力层控制图像布局和文本之间的关系，实现了在不改变原始内容情况下的全局和局部编辑，从而达到高质量的图像合成。

跨注意控制的提示到提示图像编辑

Prompt-to-Prompt Image Editing with Cross Attention Control

We propose Neural Actor (NA), a new method for high-quality synthesis of
humans from arbitrary viewpoints and under arbitrary controllable poses. Our
method is built upon recent neural scene representation and rendering works
which learn representations of geometry and appearance from only 2D images.
While existing works demonstrated compelling rendering of static scenes and
playback of dynamic scenes, photo-realistic reconstruction and rendering of
humans with neural implicit methods, in particular under user-controlled novel
poses, is still difficult. To address this problem, we utilize a coarse body
model as the proxy to unwarp the surrounding 3D space into a canonical pose. A
neural radiance field learns pose-dependent geometric deformations and pose-
and view-dependent appearance effects in the canonical space from multi-view
video input. To synthesize novel views of high fidelity dynamic geometry and
appearance, we leverage 2D texture maps defined on the body model as latent
variables for predicting residual deformations and the dynamic appearance.
Experiments demonstrate that our method achieves better quality than the
state-of-the-arts on playback as well as novel pose synthesis, and can even
generalize well to new poses that starkly differ from the training poses.
Furthermore, our method also supports body shape control of the synthesized
results.

提出了一种名为 Neural Actor 的新方法，可从任意视角和任意可控姿势中合成高质量的人类形象，其基于最近的神经场景表示和渲染作品，利用粗体模型将周围的三维空间映射为规范姿势，并从多视角视频输入中学习姿态依赖的几何变形和姿态和视觉依赖的外观效果，以预测残差变形和动态外观，并支持合成结果的体形控制。