Text-to-audio (TTA) generation is a recent popular problem that aims to
synthesize general audio given text descriptions. Previous methods utilized
latent diffusion models to learn audio embedding in a latent space with text
embedding as the condition. However, they ignored the synchronization between
audio and visual content in the video, and tended to generate audio mismatching
from video frames. In this work, we propose a novel and personalized
text-to-sound generation approach with visual alignment based on latent
diffusion models, namely DiffAVA, that can simply fine-tune lightweight
visual-text alignment modules with frozen modality-specific encoders to update
visual-aligned text embeddings as the condition. Specifically, our DiffAVA
leverages a multi-head attention transformer to aggregate temporal information
from video features, and a dual multi-modal residual network to fuse temporal
visual representations with text embeddings. Then, a contrastive learning
objective is applied to match visual-aligned text embeddings with audio
features. Experimental results on the AudioCaps dataset demonstrate that the
proposed DiffAVA can achieve competitive performance on visual-aligned
text-to-audio generation.

该文章提出了一种基于视觉对齐的新型个性化文本转语音生成方法 ——DiffAVA，它使用多头注意力变换器聚合视觉特征的时间信息，并利用双模残差网络将时间视觉表示与文本嵌入进行融合，然后采用对比学习目标来匹配视觉对齐的文本嵌入和音频特征。研究结果表明，DiffAVA 在视觉对齐的文本转音频生成方面具有竞争力的表现。