Many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than associating language with visual concepts. In this work, we propose a generic framework which we call Human Importance-aware Network Tuning (HINT) that effectively leverages human supervision to improve visual grounding. HINT constrains deep networks to be sensitive to the same input regions as humans. Crucially, our approach optimizes the alignment between human attention maps and gradient-based network importances - ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We demonstrate our approach on Visual Question Answering and Image Captioning tasks, achieving state of-the-art for the VQA-CP dataset which penalizes over-reliance on language priors.

本文提出了一种名为HINT的通用方法，通过有效利用人类演示来改善视觉基础，以优化深度神经网络的对视觉概念的敏感性，并在视觉问答和图像描述任务中应用，在仅利用6%的训练数据的人类关注示例下，优于VQA-CP和强健字幕的主要方法。

利用解释使视觉和语言模型更加基于实际 - HINT方法