Recent advancements in video generation have been remarkable, yet many
existing methods struggle with issues of consistency and poor text-video
alignment. Moreover, the field lacks effective techniques for text-guided video
inpainting, a stark contrast to the well-explored domain of text-guided image
inpainting. To this end, this paper proposes a novel text-guided video
inpainting model that achieves better consistency, controllability and
compatibility. Specifically, we introduce a simple but efficient motion capture
module to preserve motion consistency, and design an instance-aware region
selection instead of a random region selection to obtain better textual
controllability, and utilize a novel strategy to inject some personalized
models into our CoCoCo model and thus obtain better model compatibility.
Extensive experiments show that our model can generate high-quality video
clips. Meanwhile, our model shows better motion consistency, textual
controllability and model compatibility. More details are shown in
[cococozibojia.github.io](cococozibojia.github.io).

本文提出了一种新颖的文本导向的视频修复模型，实现了更好的一致性、可控性和兼容性。实验表明，该模型能够生成高质量的视频片段，并展示了更好的动作连贯性、文本可控性和模型兼容性。

CoCoCo：改进文本引导的视频修复以提升一致性、可控性和兼容性

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency,  Controllability and Compatibility

Significant advancements have been achieved in the realm of large-scale
pre-trained text-to-video Diffusion Models (VDMs). However, previous methods
either rely solely on pixel-based VDMs, which come with high computational
costs, or on latent-based VDMs, which often struggle with precise text-video
alignment. In this paper, we are the first to propose a hybrid model, dubbed as
Show-1, which marries pixel-based and latent-based VDMs for text-to-video
generation. Our model first uses pixel-based VDMs to produce a low-resolution
video of strong text-video correlation. After that, we propose a novel expert
translation method that employs the latent-based VDMs to further upsample the
low-resolution video to high resolution. Compared to latent VDMs, Show-1 can
produce high-quality videos of precise text-video alignment; Compared to pixel
VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G
vs 72G). We also validate our model on standard video generation benchmarks.
Our code and model weights are publicly available at
https://github.com/showlab/Show-1.

本文提出了一个混合模型，名为 Show-1，结合了基于像素和基于潜变量的文本到视频扩散模型，以实现精确的文本 - 视频对齐和高质量视频生成。

像素与潜在扩散模型在文字到视频生成中的融合

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video  Generation

Sequential video understanding, as an emerging video understanding task, has
driven lots of researchers' attention because of its goal-oriented nature. This
paper studies weakly supervised sequential video understanding where the
accurate time-stamp level text-video alignment is not provided. We solve this
task by borrowing ideas from CLIP. Specifically, we use a transformer to
aggregate frame-level features for video representation and use a pre-trained
text encoder to encode the texts corresponding to each action and the whole
video, respectively. To model the correspondence between text and video, we
propose a multiple granularity loss, where the video-paragraph contrastive loss
enforces matching between the whole video and the complete script, and a
fine-grained frame-sentence contrastive loss enforces the matching between each
action and its description. As the frame-sentence correspondence is not
available, we propose to use the fact that video actions happen sequentially in
the temporal domain to generate pseudo frame-sentence correspondence and
supervise the network training with the pseudo labels. Extensive experiments on
video sequence verification and text-to-video matching show that our method
outperforms baselines by a large margin, which validates the effectiveness of
our proposed approach. Code is available at this https URL

本文提出了一种基于 transformer，支持弱监督下视频理解的方法，主要包括多粒度损失函数、伪造的帧 - 句对应关系等。在视频序列验证和文本匹配实验中表现良好。