multimodal language analysis is a demanding area of research, since it is
associated with two requirements: combining different modalities and capturing
temporal information. During the last years, several works have been proposed
in the area, mostly centered around supervised learning