Question answering (QA) models are well-known to exploit data bias, e.g., the
language prior in visual QA and the position bias in reading comprehension.
Recent debiasing methods achieve good out-of-distribution (OOD)
generalizability with a considerable sacrifice of the in-distribution (ID)
performance. Therefore, they are only applicable in domains where the test
distribution is known in advance. In this paper, we present a novel debiasing
method called Introspective Distillation (IntroD) to make the best of both
worlds for QA. Our key technical contribution is to blend the inductive bias of
OOD and ID by introspecting whether a training sample fits in the factual ID
world or the counterfactual OOD one. Experiments on visual QA datasets VQA v2,
VQA-CP, and reading comprehension dataset SQuAD demonstrate that our proposed
IntroD maintains the competitive OOD performance compared to other debiasing
methods, while sacrificing little or even achieving better ID performance
compared to the non-debiasing ones.

本文提出了一种名为 Introspective Distillation（IntroD）的新型去偏差方法，通过内省训练样本是否适合现实 ID 世界或对立 OOO 世界来融合 OOD 和 ID 的归纳偏差，用于解决语言和阅读理解等领域中 QA 模型的数据偏差问题，并在 VQA v2、VQA-CP 和 SQuAD 数据集上进行了实验验证。

自省蒸馏用于稳健问答

Introspective Distillation for Robust Question Answering

Visual QA is a pivotal challenge for higher-level reasoning, requiring
understanding language, vision, and relationships between many objects in a
scene. Although datasets like CLEVR are designed to be unsolvable without such
complex relational reasoning, some surprisingly simple feed-forward, "holistic"
models have recently shown strong performance on this dataset. These models
lack any kind of explicit iterative, symbolic reasoning procedure, which are
hypothesized to be necessary for counting objects, narrowing down the set of
relevant objects based on several attributes, etc. The reason for this strong
performance is poorly understood. Hence, our work analyzes such models, and
finds that minor architectural elements are crucial to performance. In
particular, we find that \textit{early fusion} of language and vision provides
large performance improvements. This contrasts with the late fusion approaches
popular at the dawn of Visual QA. We propose a simple module we call Multimodal
Core, which we hypothesize performs the fundamental operations for multimodal
tasks. We believe that understanding why these elements are so important to
complex question answering will aid the design of better-performing algorithms
for Visual QA while minimizing hand-engineering effort.

本文研究了在 Visual QA 领域中取得强大性能的初馈聚合模型的复杂性，发现了一些架构上的要素对于其性能的关键作用，其中早期的语言 - 视觉融合是最为有效的，为此我们提出了一种称之为 “多模核” 的简单模块，旨在为多模任务提供基本操作。

早期融合和批次规范对 CLEVR 视觉问答中细节的影响

The Visual QA Devil in the Details: The Impact of Early Fusion and Batch  Norm on CLEVR

We propose a novel memory network model named Read-Write Memory Network
(RWMN) to perform question and answering tasks for large-scale, multimodal
movie story understanding. The key focus of our RWMN model is to design the
read network and the write network that consist of multiple convolutional
layers, which enable memory read and write operations to have high capacity and
flexibility. While existing memory-augmented network models treat each memory
slot as an independent block, our use of multi-layered CNNs allows the model to
read and write sequential memory cells as chunks, which is more reasonable to
represent a sequential story because adjacent memory blocks often have strong
correlations. For evaluation, we apply our model to all the six tasks of the
MovieQA benchmark, and achieve the best accuracies on several tasks, especially
on the visual QA task. Our model shows a potential to better understand not
only the content in the story, but also more abstract information, such as
relationships between characters and the reasons for their actions.

我们提出了一种名为 Read-Write 记忆网络 (RWMN) 的新型记忆网络模型，用于大规模、多模态电影故事理解的问答任务。我们的 RWMN 模型的重点是设计读取网络和写入网络，由多个卷积层组成，从而使内存读取和写入操作具有高容量和灵活性。采用多层 CNN 的读写方式，更合理地表示顺序故事，从而实现了顺序存储的表达。我们应用模型到 MovieQA 基准测试的六个任务中，取得了最好的正确率，尤其是在视觉问答任务上。我们的模型表现出更好地理解故事中的内容，以及角色之间的关系和他们行动背后的原因的潜力。