The multifaceted nature of human perception and comprehension indicates that,
when we think, our body can naturally take any combination of senses, a.k.a.,
modalities and form a beautiful picture in our brain. For example, when we see
a cattery and simultaneously perceive the cat's purring sound, our brain can
construct a picture of a cat in the cattery. Intuitively, generative AI models
should hold the versatility of humans and be capable of generating images from
any combination of modalities efficiently and collaboratively. This paper
presents ImgAny, a novel end-to-end multi-modal generative model that can mimic
human reasoning and generate high-quality images. Our method serves as the
first attempt in its capacity of efficiently and flexibly taking any
combination of seven modalities, ranging from language, audio to vision
modalities, including image, point cloud, thermal, depth, and event data. Our
key idea is inspired by human-level cognitive processes and involves the
integration and harmonization of multiple input modalities at both the entity
and attribute levels without specific tuning across modalities. Accordingly,
our method brings two novel training-free technical branches: 1) Entity Fusion
Branch ensures the coherence between inputs and outputs. It extracts entity
features from the multi-modal representations powered by our specially
constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly
preserves and processes the attributes. It efficiently amalgamates distinct
attributes from diverse input modalities via our proposed attribute knowledge
graph. Lastly, the entity and attribute features are adaptively fused as the
conditional inputs to the pre-trained Stable Diffusion model for image
generation. Extensive experiments under diverse modality combinations
demonstrate its exceptional capability for visual content creation.

ImgAny 是一种新颖的端到端多模态生成模型，可以模仿人类推理并生成高质量图像。该方法能够有效且灵活地接收来自语言、音频和视觉等七种不同的模态组合，并通过实体融合分支和属性融合分支整合多个输入模态，并利用预训练的稳定扩散模型生成图像。大量实验证明了其在视觉内容创作方面的卓越能力。

图像任意：朝着始终合理推理和无需训练的多模态图像生成

Image Anything: Towards Reasoning-coherent and Training-free Multi-modal  Image Generation

Creating and editing the shape and color of 3D objects require tremendous
human effort and expertise. Compared to direct manipulation in 3D interfaces,
2D interactions such as sketches and scribbles are usually much more natural
and intuitive for the users. In this paper, we propose a generic multi-modal
generative model that couples the 2D modalities and implicit 3D representations
through shared latent spaces. With the proposed model, versatile 3D generation
and manipulation are enabled by simply propagating the editing from a specific
2D controlling modality through the latent spaces. For example, editing the 3D
shape by drawing a sketch, re-colorizing the 3D surface via painting color
scribbles on the 2D rendering, or generating 3D shapes of a certain category
given one or a few reference images. Unlike prior works, our model does not
require re-training or fine-tuning per editing task and is also conceptually
simple, easy to implement, robust to input domain shifts, and flexible to
diverse reconstruction on partial 2D inputs. We evaluate our framework on two
representative 2D modalities of grayscale line sketches and rendered color
images, and demonstrate that our method enables various shape manipulation and
generation tasks with these 2D modalities.

该论文提出了一种通用的多模态生成模型，通过共享潜在空间将 2D 模态和隐式 3D 表示耦合在一起，实现了通过简单地传播来自特定 2D 控制模态的编辑，从而实现了多样化的 3D 生成和操作。