Human beings have rich ways of emotional expressions, including facial
action, voice, and natural languages. Due to the diversity and complexity of
different individuals, the emotions expressed by various modalities may be
semantically irrelevant. Directly fusing information from different modalities
may inevitably make the model subject to the noise from se