Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies. Our code is publicly available at: https://github.com/hyounghk/VideoQADenseCapFrameGate-ACL2020

本文提出了一种视频问答模型，它有效地集成了多模态输入源并从中找到临时相关信息以回答问题，该模型包括多种设计方法，包括基于稠密图像标题的对象及其详细显著区域和动作识别，双重关注，跨集成和引入了带有人类重要性注释来更好地监督模型的两个损失函数的门控，这个模型在多个数据集上的表现优于现有的技术。

视频问答中基于密集字幕匹配和帧选择门控的时间定位