We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30K. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

本文介绍了一项新的推理任务-视觉蕴含（Visual Entailment，VE），VE与传统的文本蕴含（Textual Entailment，TE）任务不同，它的前提是由图像定义的，而不是像TE任务中那样由自然语言句子定义的。在Stanford自然语言推理语料库和Flickr30k的基础上，提出了一个新的数据集SNLI-VE，并介绍了一种可解释的视觉蕴含模型（EVE）来解决VE问题。此外，本文还将EVE和其他几种最先进的基于视觉问答（VQA）的模型在SNLI-VE数据集上进行了评估，促进了基于语境的语言理解，并提供了关于现代VQA模型性能的见识。

基于视觉支持的语言学习的视觉蕴涵任务