Learning multi-modal representations is an essential step towards real-world
robotic applications, and various multi-modal fusion models have been developed
for this purpose. However, we observe that existing models, whose objectives
are mostly based on joint training, often suffer fro