To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.

通过情节对话、生成视频场景描述和弱监督获取外部知识，ROL模型处理任务，使用变压器编码和模态权重机制，平衡不同来源的信息。通过对知识型视频故事问答的评估，ROL模型在KnowIT VQA和TVQA +两个挑战性问题数据集上表现出卓越的效果，是一种有前途的方法。

基于知识的视频问答与无监督场景描述