In text-to-video (T2V) generation, significant attention has been directed toward its development, yet unifying discrete and continuous grounding conditions in T2V generation remains under-explored. This paper proposes a Grounded text-to-Video generation framework, termed GVDIFF. First, we inject the grounding condition into the self-attention through an uncertainty-based representation to explicitly guide the focus of the network. Second, we introduce a spatial-temporal grounding layer that connects the grounding condition with target objects and enables the model with the grounded generation capacity in the spatial-temporal domain. Third, our dynamic gate network adaptively skips the redundant grounding process to selectively extract grounding information and semantics while improving efficiency. We extensively evaluate the grounded generation capacity of GVDIFF and demonstrate its versatility in applications, including long-range video generation, sequential prompts, and object-specific editing.

本研究提出了一种基于地面化文本到视频生成框架的GVDIFF方法，通过将地面化条件引入到自注意力机制中，以明确指导网络的关注点；引入空间-时间定位层，连接地面化条件与目标对象，使模型在空间-时间领域具有地面化生成能力；动态门网络适应性地跳过冗余地面化过程，有选择地提取地面化信息和语义，提高效率。对GVDIFF的地面化生成能力进行了广泛评估，并展示了其在长距离视频生成、顺序提示和对象特定编辑等应用中的多样性。

GVDIFF：基于扩散模型的文本到视频生成