Generating realistic audio for human interactions is important for many
applications, such as creating sound effects for films or virtual reality
games. Existing approaches implicitly assume total correspondence between the
video and audio during training, yet many sounds happen off-screen and have
weak to no correspondence with the visuals -- resulting in uncontrolled ambient
sounds or hallucinations at test time. We propose a novel ambient-aware audio
generation model, AV-LDM. We devise a novel audio-conditioning mechanism to
learn to disentangle foreground action sounds from the ambient background
sounds in in-the-wild training videos. Given a novel silent video, our model
uses retrieval-augmented generation to create audio that matches the visual
content both semantically and temporally. We train and evaluate our model on
two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model
outperforms an array of existing methods, allows controllable generation of the
ambient sound, and even shows promise for generalizing to computer graphics
game clips. Overall, our work is the first to focus video-to-audio generation
faithfully on the observed visual content despite training from uncurated clips
with natural background sounds.

提出了一种新颖的环境感知音频生成模型，用于根据视频内容生成符合语义和时间要求的音频；使用了特殊的音频条件机制，以在野外训练视频中学习将前景动作声音与环境背景声音解耦合。

Action2Sound: 环境感知的从自我为中心视频生成行动声音

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric  Videos

Video-to-audio (V2A) generation aims to synthesize content-matching audio
from silent video, and it remains challenging to build V2A models with high
generation quality, efficiency, and visual-audio temporal synchrony. We propose
Frieren, a V2A model based on rectified flow matching. Frieren regresses the
conditional transport vector field from noise to spectrogram latent with
straight paths and conducts sampling by solving ODE, outperforming
autoregressive and score-based models in terms of audio quality. By employing a
non-autoregressive vector field estimator based on a feed-forward transformer
and channel-level cross-modal feature fusion with strong temporal alignment,
our model generates audio that is highly synchronized with the input video.
Furthermore, through reflow and one-step distillation with guided vector field,
our model can generate decent audio in a few, or even only one sampling step.
Experiments indicate that Frieren achieves state-of-the-art performance in both
generation quality and temporal alignment on VGGSound, with alignment accuracy
reaching 97.22%, and 6.2% improvement in inception score over the strong
diffusion-based baseline. Audio samples are available at
this http URL .

基于修正的流匹配，我们提出了 Frieren—— 一个视频到音频（V2A）生成模型，通过回归从噪声到频谱图的条件传输向量场来合成与内容匹配的音频，以高品质、高效率和视听时序同步性建立 V2A 模型依然具有挑战性。通过利用基于前馈变换器的非自回归向量场估计器和强时序对齐的通道级跨模态特征融合机制，我们的模型能够高度与输入视频同步生成音频，并通过回流和引导向量场的一步蒸馏，甚至在几个或仅一个采样步骤中产生不错的音频效果。实验结果表明 Frieren 在 VGGSound 上以 97.22% 的对齐准确率和相较于强基线扩散模型的 6.2% 改进的 Inception 分数，达到了最先进的生成质量和时序对齐性能。