Humans can easily perceive the direction of sound sources in a visual scene,
termed sound source localization. Recent studies on learning-based sound source
localization have mainly explored the problem from a localization perspective.
However, prior arts and existing benchmarks do not account for a more important
aspect of the problem, cross-modal semantic understanding, which is essential
for genuine sound source localization. Cross-modal semantic understanding is
important in understanding semantically mismatched audio-visual events, e.g.,
silent objects, or off-screen sounds. To account for this, we propose a
cross-modal alignment task as a joint task with sound source localization to
better learn the interaction between audio and visual modalities. Thereby, we
achieve high localization performance with strong cross-modal semantic
understanding. Our method outperforms the state-of-the-art approaches in both
sound source localization and cross-modal retrieval. Our work suggests that
jointly tackling both tasks is necessary to conquer genuine sound source
localization.

我们提出了一个跨模态对齐任务作为声源定位的联合任务，以更好地学习音频和视觉模态之间的交互，并在声源定位和跨模态检索方面超越了现有的方法，从而实现了较高的定位性能和强大的跨模态语义理解。

声源定位是关于跨模态对齐的全部内容

Sound Source Localization is All about Cross-Modal Alignment

While recently Multimodal Large Language Models (MM-LLMs) have made exciting
strides, they mostly fall prey to the limitation of only input-side multimodal
understanding, without the ability to produce content in multiple modalities.
As we humans always perceive the world and communicate with people through
various modalities, developing any-to-any MM-LLMs capable of accepting and
delivering content in any modality becomes essential to human-level AI. To fill
the gap, we present an end-to-end general-purpose any-to-any MM-LLM system,
NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion
decoders, enabling NExT-GPT to perceive inputs and generate outputs in
arbitrary combinations of text, images, videos, and audio. By leveraging the
existing well-trained highly-performing encoders and decoders, NExT-GPT is
tuned with only a small amount of parameter (1%) of certain projection layers,
which not only benefits low-cost training and also facilitates convenient
expansion to more potential modalities. Moreover, we introduce a
modality-switching instruction tuning (MosIT) and manually curate a
high-quality dataset for MosIT, based on which NExT-GPT is empowered with
complex cross-modal semantic understanding and content generation. Overall, our
research showcases the promising possibility of building an AI agent capable of
modeling universal modalities, paving the way for more human-like AI research
in the community.

我们介绍了一个名为 NExT-GPT 的全方位任意多模式多语言模型系统，通过使用多模态适配器和不同扩散解码器，NExT-GPT 能够接受和生成文本、图像、视频和音频等任意组合的内容，并通过调优投影层的少量参数进行训练和扩展，使其具备复杂的跨模态语义理解和内容生成能力，为构建能够模拟通用模态的人工智能代理提供了有前景的可能性。