Visual Question Answering (VQA) is the task of taking as input an image and a
free-form natural language question about the image, and producing an accurate
answer. In this work we view VQA as a "feature extraction" module to extract
image and caption representations. We employ these representations for the task
of image-caption ranking. Each feature dimension captures (imagines) whether a
fact (question-answer pair) could plausibly be true for the image and caption.
This allows the model to interpret images and captions from a wide variety of
perspectives. We propose score-level and representation-level fusion models to
incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic
image-caption ranking model. We find that incorporating and reasoning about
consistency between images and captions significantly improves performance.
Concretely, our model improves state-of-the-art on caption retrieval by 7.1%
and on image retrieval by 4.4% on the MSCOCO dataset.

本研究将视觉问题回答任务视为 “特征提取” 模块，提取图像和标题的表征，以此为基础对图像 - 标题进行排序并提出融合模型提高图像 - 标题匹配一致性的表现。实验发现，该模型在 MSCOCO 数据集上的字幕检索提高了 7.1％，图像提取提高了 4.4％。