Studies on emotion recognition (ER) show that combining lexical and acoustic
information results in more robust and accurate models. The majority of the
studies focus on settings where both modalities are available in training and
evaluation. However, in practice, this is not always th