audio-visual segmentation (AVS) aims to identify, at the pixel level, the
object in a visual scene that produces a given sound. Current AVS methods rely
on costly fine-grained annotations of mask-audio pairs, making them impractical
for scalability. To address this, we introduce unsupe