Natural language explanation in visual question answer (VQA-NLE) aims to
explain the decision-making process of models by generating natural language
sentences to increase users' trust in the black-box systems. Existing post-hoc
methods have achieved significant progress in obtaining a plausible
explanation. However, such post-hoc explanations are not always aligned with
human logical inference, suffering from the issues on: 1) Deductive
unsatisfiability, the generated explanations do not logically lead to the
answer; 2) Factual inconsistency, the model falsifies its counterfactual
explanation for answers without considering the facts in images; and 3)
Semantic perturbation insensitivity, the model can not recognize the semantic
changes caused by small perturbations. These problems reduce the faithfulness
of explanations generated by models. To address the above issues, we propose a
novel self-supervised \textbf{M}ulti-level \textbf{C}ontrastive
\textbf{L}earning based natural language \textbf{E}xplanation model (MCLE) for
VQA with semantic-level, image-level, and instance-level factual and
counterfactual samples. MCLE extracts discriminative features and aligns the
feature spaces from explanations with visual question and answer to generate
more consistent explanations. We conduct extensive experiments, ablation
analysis, and case study to demonstrate the effectiveness of our method on two
VQA-NLE benchmarks.

为了解决 VQA-NLE 模型在逻辑推理、事实一致性和语义扰动不敏感性等方面存在的问题，我们提出了一种基于自监督多层对比学习的自然语言解释模型 (MCLE)，通过提取具有辨别性的特征并将解释的特征空间与视觉问题和答案对齐，生成更一致的解释。我们通过广泛的实验、消融分析和案例研究来证明我们方法的有效性。

运用多层对比学习在 VQA 中实现更加忠实的自然语言解释

Towards More Faithful Natural Language Explanation Using Multi-Level  Contrastive Learning in VQA

In the fundamental statistics course, students are taught to remember the
well-known saying: "Correlation is not Causation". Till now, statistics (i.e.,
correlation) have developed various successful frameworks, such as Transformer
and Pre-training large-scale models, which have stacked multiple parallel
self-attention blocks to imitate a wide range of tasks. However, in the
causation community, how to build an integrated causal framework still remains
an untouched domain despite its excellent intervention capabilities. In this
paper, we propose the Causal Graph Routing (CGR) framework, an integrated
causal scheme relying entirely on the intervention mechanisms to reveal the
cause-effect forces hidden in data. Specifically, CGR is composed of a stack of
causal layers. Each layer includes a set of parallel deconfounding blocks from
different causal graphs. We combine these blocks via the concept of the
proposed sufficient cause, which allows the model to dynamically select the
suitable deconfounding methods in each layer. CGR is implemented as the stacked
networks, integrating no confounder, back-door adjustment, front-door
adjustment, and probability of sufficient cause. We evaluate this framework on
two classical tasks of CV and NLP. Experiments show CGR can surpass the current
state-of-the-art methods on both Visual Question Answer and Long Document
Classification tasks. In particular, CGR has great potential in building the
"causal" pre-training large-scale model that effectively generalizes to diverse
tasks. It will improve the machines' comprehension of causal relationships
within a broader semantic space.

本文提出了 Causal Graph Routing（CGR）框架，通过干预机制揭示数据中隐藏的因果关系，并在计算机视觉和自然语言处理领域的任务中超过当前最先进方法，具有建立具有因果性的预训练大规模模型的潜力，以在更广泛的语义空间内提高机器对因果关系的理解能力。