TL;DR本文提出了基于 SAM 模型的简单而有效的音频 - 视觉定位和分割框架 AV-SAM,可以生成对应于音频的听觉对象掩模,实现像声音定位和分割等视听任务。
Abstract
segment anything model (SAM) has recently shown its powerful effectiveness in
visual segmentation tasks. However, there is less exploration concerning how
SAM works on audio-visual tasks, such as visual sound localization and
segmentation. In this work, we propose a simple yet effectiv