TVQA:本地化、组合式视频问答

Sep, 2018

TVQA: Localized, Compositional Video Question Answering

Jie Lei, Licheng Yu, Mohit Bansal, Tamara L. Berg

TL;DR本文提出TVQA，一个基于6个流行电视节目的大规模视频问答数据集，共计包含152,545对QA对，分布在21,793个片段中，共涵盖了460小时的视频。该数据集中的问题具有组合性质，需要系统联合定位剪辑中的相关片段，理解基于字幕的对话，并识别相关的视觉概念。作者提供了该数据集的分析以及几个基线模型和一个多流端到端可训练的神经网络框架。

Abstract

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA,