A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, multimodal representation learning (MRL) has emerged as an active area of r