We introduce the visual acoustic matching task, in which an audio clip is
transformed to sound like it was recorded in a target environment. Given an
image of the target environment and a waveform for the source audio, the goal
is to re-synthesize the audio to match the target room acoustics as suggested
by its visible geometry and materials. To address this novel task, we propose a
cross-modal transformer model that uses audio-visual attention to inject visual
properties into the audio and generate realistic audio output. In addition, we
devise a self-supervised training objective that can learn acoustic matching
from in-the-wild Web videos, despite their lack of acoustically mismatched
audio. We demonstrate that our approach successfully translates human speech to
a variety of real-world environments depicted in images, outperforming both
traditional acoustic matching and more heavily supervised baselines.

本研究提出了一种使用交叉模态转换模型的视听匹配任务，该模型使用音频 - 视觉注意力将视觉特性注入音频，以生成逼真的音频输出，并使用自我监督训练目标从 “野外” Web 视频中学习声学匹配，以便将人类语音成功转换为多种实际环境，在实验中证明该方法比传统的声学匹配和更严格的监督基线都效果好。