The multifaceted nature of human perception and comprehension indicates that,
when we think, our body can naturally take any combination of senses, a.k.a.,
modalities and form a beautiful picture in our brain. For example, when we see
a cattery and simultaneously perceive the cat's purring sound, our brain can
construct a picture of a cat in the cattery. Intuitively, generative AI models
should hold the versatility of humans and be capable of generating images from
any combination of modalities efficiently and collaboratively. This paper
presents ImgAny, a novel end-to-end multi-modal generative model that can mimic
human reasoning and generate high-quality images. Our method serves as the
first attempt in its capacity of efficiently and flexibly taking any
combination of seven modalities, ranging from language, audio to vision
modalities, including image, point cloud, thermal, depth, and event data. Our
key idea is inspired by human-level cognitive processes and involves the
integration and harmonization of multiple input modalities at both the entity
and attribute levels without specific tuning across modalities. Accordingly,
our method brings two novel training-free technical branches: 1) Entity Fusion
Branch ensures the coherence between inputs and outputs. It extracts entity
features from the multi-modal representations powered by our specially
constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly
preserves and processes the attributes. It efficiently amalgamates distinct
attributes from diverse input modalities via our proposed attribute knowledge
graph. Lastly, the entity and attribute features are adaptively fused as the
conditional inputs to the pre-trained Stable Diffusion model for image
generation. Extensive experiments under diverse modality combinations
demonstrate its exceptional capability for visual content creation.

ImgAny 是一种新颖的端到端多模态生成模型，可以模仿人类推理并生成高质量图像。该方法能够有效且灵活地接收来自语言、音频和视觉等七种不同的模态组合，并通过实体融合分支和属性融合分支整合多个输入模态，并利用预训练的稳定扩散模型生成图像。大量实验证明了其在视觉内容创作方面的卓越能力。

图像任意：朝着始终合理推理和无需训练的多模态图像生成

Image Anything: Towards Reasoning-coherent and Training-free Multi-modal  Image Generation

Collecting a multimodal dataset with two paired modalities A and B or B and C
is difficult in practice. Obtaining a dataset with three aligned modalities A,
B, and C is even more challenging. For example, some public medical datasets
have only genetic sequences and microscopic images for one patient, and only
genetic sequences and radiological images for another - but no dataset includes
both microscopic and radiological images for the same patient. This makes it
difficult to integrate and combine all modalities into a large pre-trained
neural network. We introduce LoReTTa (Linking mOdalities with a tRansitive and
commutativE pre-Training sTrAtegy) to address this understudied problem. Our
self-supervised framework combines causal masked modeling with the rules of
commutativity and transitivity to transition within and between different
modalities. Thus, it can model the relation A -> C with A -> B -> C. Given a
dataset containing only the disjoint combinations (A, B) and (B, C), we show
that a transformer pre-trained with LoReTTa can handle any modality combination
at inference time, including the never-seen pair (A, C) and the triplet (A, B,
C). We evaluate our approach on a multimodal dataset derived from MNIST
containing speech, vision, and language, as well as a real-world medical
dataset containing mRNA, miRNA, and RPPA samples from TCGA. Compared to
traditional pre-training methods, we observe up to a 100-point reduction in
perplexity for autoregressive generation tasks and up to a 15% improvement in
classification accuracy for previously unseen modality pairs during the
pre-training phase.

LoReTTa 是一种自我监督框架，它通过自动学习不同模态之间的可转移和可交换特性，使得在具有三个匹配模态的情况下进行数据集的组合和整合变得简单。该方法不同于传统预训练方法，并在生成任务的复杂度和新的模态组合分类任务中展示出良好的性能。