multi-modality perception is essential to develop interactive intelligence.
In this work, we consider a new task of visual information-infused audio
inpainting, \ie synthesizing missing audio segments that correspond to their
accompanying videos. We identify two key aspects for a succe