Intelligently reasoning about the world often requires integrating data from
multiple modalities, as any individual modality may contain unreliable or
incomplete information. Prior work in multimodal learning fuses input
modalities only after significant independent processing. On the