In the last decade, video blogs (vlogs) have become an extremely popular
method through which people express sentiment. The ubiquitousness of these
videos has increased the importance of multimodal fusion models, which
incorporate video and audio features with traditional text features for
automatic sentiment detection. Multimodal fusion offers a unique opportunity to
build models that learn from the full depth of expression available to human
viewers. In the detection of sentiment in these videos, acoustic and video
features provide clarity to otherwise ambiguous transcripts. In this paper, we
present a multimodal fusion model that exclusively uses high-level video and
audio features to analyze spoken sentences for sentiment. We discard
traditional transcription features in order to minimize human intervention and
to maximize the deployability of our model on at-scale real-world data. We
select high-level features for our model that have been successful in nonaffect
domains in order to test their generalizability in the sentiment detection
domain. We train and test our model on the newly released CMU Multimodal
Opinion Sentiment and Emotion Intensity (CMUMOSEI) dataset, obtaining an F1
score of 0.8049 on the validation set and an F1 score of 0.6325 on the held-out
challenge test set.

本文介绍了一种多模态融合模型，该模型专门使用高级视频和音频特征来分析口语句子的情感。该模型在 CMUMOSEI 数据集上进行了训练和测试，并获得了验证集上的 F1 得分 0.8049 和挑战测试集上的 F1 得分 0.6325。