Generating combined visual and auditory sensory experiences is critical for
the consumption of immersive content. Recent advances in neural generative
models have enabled the creation of high-resolution content across multiple
modalities such as images, text, speech, and videos. Despite these successes,
there remains a significant gap in the generation of hi