Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: https://lyne1.github.io/RealGeneral/

本研究解决了在单一框架中统一多样化图像生成任务的挑战。我们提出了一种新颖的框架RealGeneral，它将图像生成重新定义为条件帧预测任务，并引入统一的条件嵌入模块和统一流DiT块，减少了模态间干扰。实验结果表明，RealGeneral在多个重要视觉生成任务中表现出色，能够显著提高生成的主题相似性和图像质量。

RealGeneral：通过视频模型实现统一的视觉生成