Providing explanations for visual question answering (VQA) has gained much attention in research. However, most existing systems use separate models for predicting answers and providing explanations. We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. To address this, we propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations (UMAE). To achieve this, we add artificial prompt tokens to training instances and finetune a multimodal encoder-decoder model on various VQA tasks. In our experiments, UMAE models surpass the prior SOTA answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new SOTA explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.

提出了一种基于多任务学习的统一模型（UMAE）来解决现有的视觉问答系统中存在的回答和解释分离的问题，其方法涉及在训练数据集中添加人工提示令牌，并在各种 VQA 相关任务上进行细调，实验证明该模型在准确性、解释性和领域外表现等方面均得到了明显的提高。

面向视觉问答中生成答案和解释的统一模型