3D-consistent image generation from a single 2D semantic label is an
important and challenging research topic in computer graphics and computer
vision. Although some related works have made great progress in this field,
most of the existing methods suffer from poor disentanglement performance of
shape and appearance, and lack multi-modal control. In this paper, we propose a
novel end-to-end 3D-aware image generation and editing model incorporating
multiple types of conditional inputs, including pure noise, text and reference
image. On the one hand, we dive into the latent space of 3D Generative
Adversarial Networks (GANs) and propose a novel disentanglement strategy to
separate appearance features from shape features during the generation process.
On the other hand, we propose a unified framework for flexible image generation
and editing tasks with multi-modal conditions. Our method can generate diverse
images with distinct noises, edit the attribute through a text description and
conduct style transfer by giving a reference RGB image. Extensive experiments
demonstrate that the proposed method outperforms alternative approaches both
qualitatively and quantitatively on image generation and editing.

本文提出了一种新颖的端到端 3D 感知图像生成和编辑模型，通过纯噪声、文本和参考图像等多种条件输入，在 3D 生成对抗网络（GANs）的潜在空间中深入研究并提出解缠特性较好的生成策略，同时采用统一框架进行灵活的图像生成和编辑任务，实现多模态条件下的多样图像生成、属性编辑和风格迁移。广泛实验证明，该方法在图像生成和编辑方面在质量和数量上均优于替代方法。

多模态条件下的三维感知图像生成和编辑

3D-aware Image Generation and Editing with Multi-modal Conditions

Existing music-driven 3D dance generation methods mainly concentrate on
high-quality dance generation, but lack sufficient control during the
generation process. To address these issues, we propose a unified framework
capable of generating high-quality dance movements and supporting multi-modal
control, including genre control, semantic control, and spatial control. First,
we decouple the dance generation network from the dance control network,
thereby avoiding the degradation in dance quality when adding additional
control information. Second, we design specific control strategies for
different control information and integrate them into a unified framework.
Experimental results show that the proposed dance generation framework
outperforms state-of-the-art methods in terms of motion quality and
controllability.

我们提出了一个统一的框架，能够生成高质量的舞蹈动作并支持多模态控制，包括流派控制，语义控制和空间控制，实验证明所提出的舞蹈生成框架在动作质量和可控性方面优于现有的方法。

音乐驱动舞蹈生成中的多模态控制探索

Exploring Multi-Modal Control in Music-Driven Dance Generation

Text-conditional diffusion models are able to generate high-fidelity images
with diverse contents. However, linguistic representations frequently exhibit
ambiguous descriptions of the envisioned objective imagery, requiring the
incorporation of additional control signals to bolster the efficacy of
text-guided diffusion models. In this work, we propose Cocktail, a pipeline to
mix various modalities into one embedding, amalgamated with a generalized
ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a
spatial guidance sampling method, to actualize multi-modal and
spatially-refined control for text-conditional diffusion models. Specifically,
we introduce a hyper-network gControlNet, dedicated to the alignment and
infusion of the control signals from disparate modalities into the pre-trained
diffusion model. gControlNet is capable of accepting flexible modality signals,
encompassing the simultaneous reception of any combination of modality signals,
or the supplementary fusion of multiple modality signals. The control signals
are then fused and injected into the backbone model according to our proposed
ControlNorm. Furthermore, our advanced spatial guidance sampling methodology
proficiently incorporates the control signal into the designated region,
thereby circumventing the manifestation of undesired objects within the
generated image. We demonstrate the results of our method in controlling
various modalities, proving high-quality synthesis and fidelity to multiple
external signals.

使用多模态混合、改进的控制规范和空间引导采样方法实现对文本条件扩散模型的多模态和空间细化控制，从而生成高品质的合成图像。

Cocktail: 组合多模态控制以生成基于文本的图像

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image  Generation

Diffusion models arise as a powerful generative tool recently. Despite the
great progress, existing diffusion models mainly focus on uni-modal control,
i.e., the diffusion process is driven by only one modality of condition. To
further unleash the users' creativity, it is desirable for the model to be
controllable by multiple modalities simultaneously, e.g., generating and
editing faces by describing the age (text-driven) while drawing the face shape
(mask-driven). In this work, we present Collaborative Diffusion, where
pre-trained uni-modal diffusion models collaborate to achieve multi-modal face
generation and editing without re-training. Our key insight is that diffusion
models driven by different modalities are inherently complementary regarding
the latent denoising steps, where bilateral connections can be established
upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively
hallucinates multi-modal denoising steps by predicting the spatial-temporal
influence functions for each pre-trained uni-modal model. Collaborative
Diffusion not only collaborates generation capabilities from uni-modal
diffusion models, but also integrates multiple uni-modal manipulations to
perform multi-modal editing. Extensive qualitative and quantitative experiments
demonstrate the superiority of our framework in both image quality and
condition consistency.

本文提出了一种名为 Collaborative Diffusion 的模型，在不需要重新训练的情况下，利用多种单一模态扩展固有的单一模态扩散模型以实现多模态人脸生成和编辑。