deep multimodal learning has achieved great progress in recent years.
However, current fusion approaches are static in nature, i.e., they process and
fuse multimodal inputs with identical computation, without accounting for
diverse computational demands of different multimodal data. In