Recent endeavors in video editing have showcased promising results in
single-attribute editing or style transfer tasks, either by training
text-to-video (T2V) models on text-video data or adopting training-free
methods. However, when confronted with the complexities of multi-attribute
editing scenarios, they exhibit shortcomings such as omitting or overlooking
intended attribute changes, modifying the wrong elements of the input video,
and failing to preserve regions of the input video that should remain intact.
To address this, here we present a novel grounding-guided video-to-video
translation framework called Ground-A-Video for multi-attribute video editing.
Ground-A-Video attains temporally consistent multi-attribute editing of input
videos in a training-free manner without aforementioned shortcomings. Central
to our method is the introduction of Cross-Frame Gated Attention which
incorporates groundings information into the latent representations in a
temporally consistent fashion, along with Modulated Cross-Attention and optical
flow guided inverted latents smoothing. Extensive experiments and applications
demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline
methods in terms of edit-accuracy and frame consistency. Further results and
codes are provided at our project page (this http URL).

一种名为 Ground-A-Video 的新型基于 groundings 的多属性视频编辑框架，通过引入跨帧门控注意力、调制交叉注意力和光流引导反转隐藏特征平滑，实现了无需训练的时间一致的多属性视频编辑，并在编辑准确度和帧一致性方面表现优于其他基准方法。

通过文本图像传播模型进行零样本视频编辑

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image  Diffusion Models

Large language models readily adapt to novel settings, even without
task-specific training data. Can their zero-shot capacity be extended to
multimodal inputs? In this work, we propose ESPER which extends language-only
zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to
language model generations without direct supervision: for example, in the
image case our reward optimization relies only on cosine similarity derived
from CLIP, and thus requires no additional explicitly paired (image, caption)
data. Because the parameters of the language model are left unchanged, the
model maintains its capacity for zero-shot generalization. Experiments
demonstrate that ESPER outperforms baselines and prior work on a variety of
zero-shot tasks; these include a new benchmark we collect+release, ESP dataset,
which tasks models with generating several diversely-styled captions for each
image.

本论文提出了一种名为 ESPER 的方法，将仅基于语言的零 - shot 模型扩展到未见过的多模态任务，如图像和音频字幕生成，采用强化学习来无需直接监督地将多模态输入与语言模型生成对齐，实验表明该方法胜过了基线和之前工作的新基准测试。