The multifaceted nature of human perception and comprehension indicates that,
when we think, our body can naturally take any combination of senses, a.k.a.,
modalities and form a beautiful picture in our brain. For example, when we see
a cattery and simultaneously perceive the cat's purring sound, our brain can
construct a picture of a cat in the cattery. Intuitively, generative AI models
should hold the versatility of humans and be capable of generating images from
any combination of modalities efficiently and collaboratively. This paper
presents ImgAny, a novel end-to-end multi-modal generative model that can mimic
human reasoning and generate high-quality images. Our method serves as the
first attempt in its capacity of efficiently and flexibly taking any
combination of seven modalities, ranging from language, audio to vision
modalities, including image, point cloud, thermal, depth, and event data. Our
key idea is inspired by human-level cognitive processes and involves the
integration and harmonization of multiple input modalities at both the entity
and attribute levels without specific tuning across modalities. Accordingly,
our method brings two novel training-free technical branches: 1) Entity Fusion
Branch ensures the coherence between inputs and outputs. It extracts entity
features from the multi-modal representations powered by our specially
constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly
preserves and processes the attributes. It efficiently amalgamates distinct
attributes from diverse input modalities via our proposed attribute knowledge
graph. Lastly, the entity and attribute features are adaptively fused as the
conditional inputs to the pre-trained Stable Diffusion model for image
generation. Extensive experiments under diverse modality combinations
demonstrate its exceptional capability for visual content creation.

ImgAny 是一种新颖的端到端多模态生成模型，可以模仿人类推理并生成高质量图像。该方法能够有效且灵活地接收来自语言、音频和视觉等七种不同的模态组合，并通过实体融合分支和属性融合分支整合多个输入模态，并利用预训练的稳定扩散模型生成图像。大量实验证明了其在视觉内容创作方面的卓越能力。

图像任意：朝着始终合理推理和无需训练的多模态图像生成

Image Anything: Towards Reasoning-coherent and Training-free Multi-modal  Image Generation

Recent years have seen remarkable progress in deep learning powered visual
content creation. This includes 3D-aware generative image synthesis, which
produces high-fidelity images in a 3D-consistent manner while simultaneously
capturing compact surfaces of objects from pure image collections without the
need for any 3D supervision, thus bridging the gap between 2D imagery and 3D
reality. The 3D-aware generative models have shown that the introduction of 3D
information can lead to more controllable image generation. The task of
3D-aware image synthesis has taken the field of computer vision by storm, with
hundreds of papers accepted to top-tier journals and conferences in recent year
(mainly the past two years), but there lacks a comprehensive survey of this
remarkable and swift progress. Our survey aims to introduce new researchers to
this topic, provide a useful reference for related works, and stimulate future
research directions through our discussion section. Apart from the presented
papers, we aim to constantly update the latest relevant papers along with
corresponding implementations at
this https URL.

介绍了近年来深度学习实现的视觉内容创作方法的显著进展，包括 3D 感知生成图像合成，旨在提供一个关于 3D 感知图像合成的综述，为相关领域的研究工作提供有用的参考，并通过我们的讨论部分激发未来的研究方向。

3D 感知图像合成综述

A Survey on 3D-aware Image Synthesis

Visual content creation has spurred a soaring interest given its applications
in mobile photography and AR / VR. Style transfer and single-image 3D
photography as two representative tasks have so far evolved independently. In
this paper, we make a connection between the two, and address the challenging
task of 3D photo stylization - generating stylized novel views from a single
image given an arbitrary style. Our key intuition is that style transfer and
view synthesis have to be jointly modeled for this task. To this end, we
propose a deep model that learns geometry-aware content features for
stylization from a point cloud representation of the scene, resulting in
high-quality stylized images that are consistent across views. Further, we
introduce a novel training protocol to enable the learning using only 2D
images. We demonstrate the superiority of our method via extensive qualitative
and quantitative studies, and showcase key applications of our method in light
of the growing demand for 3D content creation from 2D image assets.

本文提出了一个深度模型，在场景的点云表示中学习基于几何感知的内容特征，以生成高质量且在视图上具有一致性的艺术化图像，从而实现了从单个图像到任意风格化图像的 3D 照片艺术化生成，并在定性和定量研究中展示了方法的优越性。