Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.

此研究旨在通过提供合理的理由作为预测答案的原因，解释并回答视觉场景问题。尽管Vision-Language Transformers在表现上令人满意，但它们存在预训练效果有限、意外的语言偏见、模型架构受限和忽视重要的物体-标签相关性等缺点。因此，从数据集、评估指标和训练技巧等角度，本研究对于未来研究提出了一些方向，有望让研究人员重新审视VCR的直觉和目标，并帮助克服视觉推理中的挑战。

视觉-语言Transformer是否具备视觉常识？对VCR的经验研究