Acoustic matching aims to re-synthesize an audio clip to sound as if it were
recorded in a target acoustic environment. Existing methods assume access to
paired training data, where the audio is observed in both source and target
environments, but this limits the diversity of training data or requires the
use of simulated data or heuristics to create paired samples. We propose a
self-supervised approach to visual acoustic matching where training samples
include only the target scene image and audio -- without acoustically
mismatched source audio for reference. Our approach jointly learns to
disentangle room acoustics and re-synthesize audio into the target environment,
via a conditional GAN framework and a novel metric that quantifies the level of
residual acoustic information in the de-biased audio. Training with either
in-the-wild web data or simulated data, we demonstrate it outperforms the
state-of-the-art on multiple challenging datasets and a wide variety of
real-world audio and environments.

通过条件生成对抗网络框架和一种测量去偏音频中残留声学信息水平的新型指标，我们提出了一种自监督的视觉声学匹配方法，能够在不使用不匹配的源音频作为参考的情况下，将音频重构为目标环境中的音频并学习解开房间声学效应，无论是通过野外网络数据还是模拟数据进行训练，我们的方法在多个具有挑战性的数据集和各种真实世界的音频和环境中表现优于现有方法。

自监督视觉声音匹配

Self-Supervised Visual Acoustic Matching

We introduce the visual acoustic matching task, in which an audio clip is
transformed to sound like it was recorded in a target environment. Given an
image of the target environment and a waveform for the source audio, the goal
is to re-synthesize the audio to match the target room acoustics as suggested
by its visible geometry and materials. To address this novel task, we propose a
cross-modal transformer model that uses audio-visual attention to inject visual
properties into the audio and generate realistic audio output. In addition, we
devise a self-supervised training objective that can learn acoustic matching
from in-the-wild Web videos, despite their lack of acoustically mismatched
audio. We demonstrate that our approach successfully translates human speech to
a variety of real-world environments depicted in images, outperforming both
traditional acoustic matching and more heavily supervised baselines.

本研究提出了一种使用交叉模态转换模型的视听匹配任务，该模型使用音频 - 视觉注意力将视觉特性注入音频，以生成逼真的音频输出，并使用自我监督训练目标从 “野外” Web 视频中学习声学匹配，以便将人类语音成功转换为多种实际环境，在实验中证明该方法比传统的声学匹配和更严格的监督基线都效果好。