How does audio describe the world around us? In this paper, we propose a
method for generating an image of a scene from sound. Our method addresses the
challenges of dealing with the large gaps that often exist between sight and
sound. We design a model that works by scheduling the lea