Recent advances in diffusion-based generative modeling have led to the
development of text-to-video (T2V) models that can generate high-quality videos
conditioned on a text prompt. Most of these T2V models often produce
single-scene video clips that depict an entity performing a particular action
(e.g., `a red panda climbing a tree'). However, it is pertinent to generate
multi-scene videos since they are ubiquitous in the real-world (e.g., `a red
panda climbing a tree' followed by `the red panda sleeps on the top of the
tree'). To generate multi-scene videos from the pretrained T2V model, we
introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the
text-conditioning mechanism in the T2V architecture to recognize the temporal
alignment between the video scenes and scene descriptions. For instance, we
condition the visual features of the earlier and later scenes of the generated
video with the representations of the first scene description (e.g., `a red
panda climbing a tree') and second scene description (e.g., `the red panda
sleeps on the top of the tree'), respectively. As a result, we show that the
T2V model can generate multi-scene videos that adhere to the multi-scene text
descriptions and be visually consistent (e.g., entity and background). Further,
we finetune the pretrained T2V model with multi-scene video-text data using the
TALC framework. We show that the TALC-finetuned model outperforms the baseline
methods by 15.5 points in the overall score, which averages visual consistency
and text adherence using human evaluation. The project website is
this https URL

我们介绍一种称为 Time-Aligned Captions（TALC）框架的方法，通过增强文本条件机制，使得文本到视频（T2V）模型能够生成符合多场景文本描述的多场景视频，并且在视觉上具有一致性。通过使用 TALC 框架对预训练的 T2V 模型进行微调，我们展示了与基线方法相比，TALC 微调模型在整体评分上优于基线方法 15.5 个百分点，综合考虑了视觉一致性和文本符合度。

TALC: 多场景文本到视频生成的时间对齐字幕

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to
image generation that integrates the concept of fixed point solving into the
framework of diffusion-based generative modeling. Our approach embeds an
implicit fixed point solving layer into the denoising network of a diffusion
model, transforming the diffusion process into a sequence of closely-related
fixed point problems. Combined with a new stochastic training method, this
approach significantly reduces model size, reduces memory usage, and
accelerates training. Moreover, it enables the development of two new
techniques to improve sampling efficiency: reallocating computation across
timesteps and reusing fixed point solutions between timesteps. We conduct
extensive experiments with state-of-the-art models on ImageNet, FFHQ,
CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in
performance and efficiency. Compared to the state-of-the-art DiT model, FPDM
contains 87% fewer parameters, consumes 60% less memory during training, and
improves image generation quality in situations where sampling computation or
time is limited. Our code and pretrained models are available at
this https URL

我们介绍了一种新颖的方法 ——Fixed Point Diffusion Model（FPDM），它将固定点求解的概念融入了基于扩散的生成模型框架中。通过将隐式固定点求解层嵌入到扩散模型的去噪网络中，我们的方法将扩散过程转化为一系列紧密相关的固定点问题。结合新的随机训练方法，这种方法显著减小了模型大小，降低了内存使用量，并加快了训练速度。此外，它还提供了两种新技术来提高采样效率：在时间步长之间重新分配计算和重用固定点解。我们在 ImageNet、FFHQ、CelebA-HQ 和 LSUN-Church 上进行了大量实验，证明了性能和效率的显著改进。与最先进的 DiT 模型相比，FPDM 的参数数量减少了 87%，训练过程中的内存消耗减少了 60%，并且在采样计算或时间有限的情况下提高了图像生成质量。我们的代码和预训练模型可以在此 https URL 中获得。

固定点扩散模型

Fixed Point Diffusion Models

We investigate the approximation efficiency of score functions by deep neural
networks in diffusion-based generative modeling. While existing approximation
theories utilize the smoothness of score functions, they suffer from the curse
of dimensionality for intrinsically high-dimensional data. This limitation is
pronounced in graphical models such as Markov random fields, common for image
distributions, where the approximation efficiency of score functions remains
unestablished.
To address this, we observe score functions can often be well-approximated in
graphical models through variational inference denoising algorithms.
Furthermore, these algorithms are amenable to efficient neural network
representation. We demonstrate this in examples of graphical models, including
Ising models, conditional Ising models, restricted Boltzmann machines, and
sparse encoding models. Combined with off-the-shelf discretization error bounds
for diffusion-based sampling, we provide an efficient sample complexity bound
for diffusion-based generative modeling when the score function is learned by
deep neural networks.

利用深度神经网络来近似评分函数的效率在基于扩散的生成建模中进行了研究，我们观察到评分函数可以通过变分推断去噪算法在图模型中得到较好的近似，同时这些算法适用于高效的神经网络表示，通过示例验证了这一观察，并结合离散化误差界限为基于扩散的生成建模提供了有效的样本复杂度界限。