Text-guided image generation has witnessed unprecedented progress due to the
development of diffusion models. Beyond text and image, sound is a vital
element within the sphere of human perception, offering vivid representations
and naturally coinciding with corresponding scenes. Taking advantage of sound
therefore presents a promising avenue for exploration within image generation
research. However, the relationship between audio and image supervision remains
significantly underdeveloped, and the scarcity of related, high-quality
datasets brings further obstacles. In this paper, we propose a unified
framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation,
editing, and stylization. In particular, our method adapts input sound into a
sound token, like an ordinary word, which can plug and play with existing
powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first
train a multi-modal encoder to align audio representation with the pre-trained
textual manifold and visual manifold, respectively. Then, we propose the audio
adapter to adapt audio representation into an audio token enriched with
specific semantics, which can be injected into a frozen T2I model flexibly. In
this way, we are able to extract the dynamic information of varied sounds,
while utilizing the formidable capability of existing T2I models to facilitate
sound-guided image generation, editing, and stylization in a convenient and
cost-effective manner. The experiment results confirm that our proposed AAI
outperforms other text and sound-guided state-of-the-art methods. And our
aligned multi-modal encoder is also competitive with other approaches in the
audio-visual retrieval and audio-text retrieval tasks.

本文提出了一个统一框架 ——Align, Adapt, and Inject (AAI)，用于基于声音进行图像生成、编辑和风格化。其方法将输入的声音转换成一个声音令牌，并利用现有强大的扩散式 T2I 模型，从而实现了方便而经济的声音引导的图像生成、编辑和风格化。实验表明，AAI 方法优于其他最先进的文本和声音引导方法。

对齐，自适应和注入：音频引导的统一图像生成

Align, Adapt and Inject: Sound-guided Unified Image Generation

We propose a visual-linguistic representation learning approach within a
self-supervised learning framework by introducing a new operation, loss, and
data augmentation strategy. First, we generate diverse features for the
image-text matching (ITM) task via soft-masking the regions in an image, which
are most relevant to a certain word in the corresponding caption, instead of
completely removing them. Since our framework relies only on image-caption
pairs with no fine-grained annotations, we identify the relevant regions to
each word by computing the word-conditional visual attention using multi-modal
encoder. Second, we encourage the model to focus more on hard but diverse
examples by proposing a focal loss for the image-text contrastive learning
(ITC) objective, which alleviates the inherent limitations of overfitting and
bias issues. Last, we perform multi-modal data augmentations for
self-supervised learning via mining various examples by masking texts and
rendering distortions on images. We show that the combination of these three
innovations is effective for learning a pretrained model, leading to
outstanding performance on multiple vision-language downstream tasks.

提出了一个自我监督学习框架中的视觉语言表示学习方法，引入了一种新的操作、损失和数据增强策略，其中将图像中最相关于对应的标题中某个单词的区域进行软掩蔽以生成多样的图像特征，然后通过多模态编码器计算出每个单词的条件视觉注意力来确定与其相关的区域，提出了一个用于图像文本对比学习（ITC）目标的焦点损失，并进行多模态数据增强以进行自我监督学习。