How to effectively interact audio with vision has garnered considerable
interest within the multi-modality research field. Recently, a novel
audio-visual segmentation (AVS) task has been proposed, aiming to segment the
sounding objects in video frames under the guidance of audio cues. However,
most existing AVS methods are hindered by a modality imbalance where the visual
features tend to dominate those of the audio modality, due to a unidirectional
and insufficient integration of audio cues. This imbalance skews the feature
representation towards the visual aspect, impeding the learning of joint
audio-visual representations and potentially causing segmentation inaccuracies.
To address this issue, we propose AVSAC. Our approach features a Bidirectional
Audio-Visual Decoder (BAVD) with integrated bidirectional bridges, enhancing
audio cues and fostering continuous interplay between audio and visual
modalities. This bidirectional interaction narrows the modality imbalance,
facilitating more effective learning of integrated audio-visual
representations. Additionally, we present a strategy for audio-visual
frame-wise synchrony as fine-grained guidance of BAVD. This strategy enhances
the share of auditory components in visual features, contributing to a more
balanced audio-visual representation learning. Extensive experiments show that
our method attains new benchmarks in AVS performance.

提出了 AVSAC 方法，通过构建双向音频 - 视觉解码器并采用二向桥接设计，实现了音频线索的增强和音频与视觉模态之间的连续交互，从而缩小模态不平衡、促进整合音频 - 视觉表示的有效学习。此外，提出了音频 - 视觉帧同步策略，通过更好的同步音频组件与视觉特征，有助于更平衡的音频 - 视觉表示学习。大量实验证明，该方法在 AVS 性能方面取得了新的突破。

音频线索加强的音频视觉分割引导

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Audio-visual representation learning aims to develop systems with human-like
perception by utilizing correlation between auditory and visual information.
However, current models often focus on a limited set of tasks, and
generalization abilities of learned representations are unclear. To this end,
we propose the AV-SUPERB benchmark that enables general-purpose evaluation of
unimodal audio/visual and bimodal fusion representations on 7 datasets covering
5 audio-visual tasks in speech and audio processing. We evaluate 5 recent
self-supervised models and show that none of these models generalize to all
tasks, emphasizing the need for future study on improving universal model
performance. In addition, we show that representations may be improved with
intermediate-task fine-tuning and audio event classification with AudioSet
serves as a strong intermediate task. We release our benchmark with evaluation
code and a model submission platform to encourage further research in
audio-visual learning.

音频 - 视觉表示学习，一种开发具有类似于人类感知的系统的方法，利用声音和视觉信息之间的相关性。然而，目前的模型往往专注于有限的任务集，并且对学习表示的泛化能力尚不清楚。因此，我们提出了 AV-SUPERB 基准，它在涵盖语音和音频处理中的 5 个音频 - 视觉任务的 7 个数据集上，能够对单模音频 / 视觉和双模融合表示进行通用评估。我们评估了 5 个最近的自监督模型，并表明这些模型都不能泛化到所有任务，强调了未来需要改进通用模型性能的研究的必要性。此外，我们表明通过中间任务微调和使用 AudioSet 进行音频事件分类可以改进表示。我们发布了我们的基准测试，提供了评估代码和模型提交平台，以鼓励进一步进行音频 - 视觉学习的研究。

AV-SUPERB: 音频 - 视觉表示模型的多任务评估基准

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual  Representation Models

Audio-visual representation learning is an important task from the
perspective of designing machines with the ability to understand complex
events. To this end, we propose a novel multimodal framework that instantiates
multiple instance learning. We show that the learnt representations are useful
for classifying events and localizing their characteristic audio-visual
elements. The system is trained using only video-level event labels without any
timing information. An important feature of our method is its capacity to learn
from unsynchronized audio-visual events. We achieve state-of-the-art results on
a large-scale dataset of weakly-labeled audio event videos. Visualizations of
localized visual regions and audio segments substantiate our system's efficacy,
especially when dealing with noisy situations where modality-specific cues
appear asynchronously.

本文提出了一种基于多模态学习的新型框架，可以从非同步的音频和视觉事件中学习，用于事件分类和定位。使用该方法可以取得弱标签音频事件视频大规模数据集的最先进结果。