Recent advancements in image generation have made significant progress, yet existing models present limitations in perceiving and generating an arbitrary number of interrelated images within a broad context. This limitation becomes increasingly critical as the demand for multi-image scenarios, such as multi-view images and visual narratives, grows with the expansion of multimedia platforms. This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images, offering a scalable solution that obviates the need for task-specific solutions across different multi-image scenarios. To facilitate this, we present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. Utilizing Stable Diffusion with varied latent noises, our method produces a set of interconnected images from a single caption. Leveraging MIS, we learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework. Throughout training on the synthetic MIS, the model excels in capturing style and content from preceding images - synthetic or real - and generates novel images following the captured patterns. Furthermore, through task-specific fine-tuning, our model demonstrates its adaptability to various multi-image generation tasks, including Novel View Synthesis and Visual Procedure Generation.

这篇论文介绍了一种领域通用的多对多图像生成框架，能够从给定的图像集合中生成相互关联的图像系列，并提供可扩展的解决方案，无需在不同的多图像场景中使用任务特定的解决方案。利用MIS数据集，该方法使用稳定扩散和不同的潜在噪声从单个标题生成一组相互关联的图像。通过在MIS数据集上进行训练，该模型能够捕捉到先前图像（合成或真实）的风格和内容，并生成遵循这些模式的新图像。此外，通过任务特定的微调，我们的模型展示了其适应各种多图像生成任务的能力，包括新视角合成和视觉流程生成。

多对多图像生成与自回归扩散模型