To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/

本研究针对视频生成领域中用户意图解释的瓶颈，提出了Any2Caption框架，可在任何条件下进行可控视频生成。该框架采用现代多模态大语言模型，将多种输入（如文本、图像、视频及特定提示）解耦为结构化的标题，从而为视频生成器提供更好的指导。评估结果显示，该系统在可控性和视频质量方面显著提升。

Any2Caption：解释任何条件以生成可控视频的标题