Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised \textbf{M}ulti-level \textbf{C}ontrastive \textbf{L}earning based natural language \textbf{E}xplanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks.

为了解决VQA-NLE模型在逻辑推理、事实一致性和语义扰动不敏感性等方面存在的问题，我们提出了一种基于自监督多层对比学习的自然语言解释模型(MCLE)，通过提取具有辨别性的特征并将解释的特征空间与视觉问题和答案对齐，生成更一致的解释。我们通过广泛的实验、消融分析和案例研究来证明我们方法的有效性。

运用多层对比学习在VQA中实现更加忠实的自然语言解释