Visual Question Answering (VQA) has attracted attention from both computer
vision and natural language processing communities. Most existing approaches
adopt the pipeline of representing an image via pre-trained CNNs, and then
using the uninterpretable CNN features in conjunction with the question to
predict the answer. Although such end-to-end models might report promising
performance, they rarely provide any insight, apart from the answer, into the
VQA process. In this work, we propose to break up the end-to-end VQA into two
steps: explaining and reasoning, in an attempt towards a more explainable VQA
by shedding light on the intermediate results between these two steps. To that
end, we first extract attributes and generate descriptions as explanations for
an image using pre-trained attribute detectors and image captioning models,
respectively. Next, a reasoning module utilizes these explanations in place of
the image to infer an answer to the question. The advantages of such a
breakdown include: (1) the attributes and captions can reflect what the system
extracts from the image, thus can provide some explanations for the predicted
answer; (2) these intermediate results can help us identify the inabilities of
both the image understanding part and the answer inference part when the
predicted answer is wrong. We conduct extensive experiments on a popular VQA
dataset and dissect all results according to several measurements of the
explanation quality. Our system achieves comparable performance with the
state-of-the-art, yet with added benefits of explainability and the inherent
ability to further improve with higher quality explanations.

本研究提出将端到端的 VQA 分解为解释和推理两步，使用预训练的属性检测器和图像字幕模型提取图像属性和生成图像描述，然后使用推理模块将这些解释代替图像推断问题的答案。通过对热门 VQA 数据集进行实验，我们证明了该系统具备解释性和进一步提高解释质量的内在能力。

Tell-and-Answer: 基于属性和字幕的可解释视觉问答

Tell-and-Answer: Towards Explainable Visual Question Answering using  Attributes and Captions

Attributes possess appealing properties and benefit many computer vision
problems, such as object recognition, learning with humans in the loop, and
image retrieval. Whereas the existing work mainly pursues utilizing attributes
for various computer vision problems, we contend that the most basic
problem---how to accurately and robustly detect attributes from images---has
been left under explored. Especially, the existing work rarely explicitly
tackles the need that attribute detectors should generalize well across
different categories, including those previously unseen. Noting that this is
analogous to the objective of multi-source domain generalization, if we treat
each category as a domain, we provide a novel perspective to attribute
detection and propose to gear the techniques in multi-source domain
generalization for the purpose of learning cross-category generalizable
attribute detectors. We validate our understanding and approach with extensive
experiments on four challenging datasets and three different problems.

本文探讨了如何从图像中准确地和鲁棒性地检测属性，并借鉴了多源域泛化的方法，为学习跨类别通用属性检测器提供了一种新的视角。经过对四个具有挑战性的数据集和三个不同问题的广泛实验验证了该方法的有效性。