Perceiving a scene most fully requires all the senses. Yet modeling how
objects look and sound is challenging: most natural scenes and events contain
multiple objects, and the audio track mixes all the sound sources together. We
propose to learn audio-visual object models from unlabele