Existing visual question answering methods tend to capture the cross-modal
spurious correlations, and fail to discover the true causal mechanism that
facilitates reasoning truthfully based on the dominant visual evidence and the
question intention. Additionally, the existing methods usually ignore the
cross-modal event-level understanding that requires to jointly model event
temporality, causality, and dynamics. In this work, we focus on event-level
visual question answering from a new perspective, i.e., cross-modal causal
relational reasoning, by introducing causal intervention methods to discover
the true causal structures for visual and linguistic modalities. Specifically,
we propose a novel event-level visual question answering framework named
Cross-Modal Causal RelatIonal Reasoning (CMCIR), to achieve robust
causality-aware visual-linguistic question answering. To discover cross-modal
causal structures, the Causality-aware Visual-Linguistic Reasoning (CVLR)
module is proposed to collaboratively disentangle the visual and linguistic
spurious correlations via front-door and back-door causal interventions. To
model the fine-grained interactions between linguistic semantics and
spatial-temporal representations, we build a Spatial-Temporal Transformer (STT)
that builds the multi-modal co-occurrence interactions between visual and
linguistic content. To adaptively fuse the causality-ware visual and linguistic
features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that
leverages the hierarchical linguistic semantic relations as the guidance to
learn the global semantic-aware visual-linguistic representations adaptively.
Extensive experiments on four event-level datasets demonstrate the superiority
of our CMCR for discovering visual-linguistic causal structures and achieving
robust event-level visual question answering.

本文提出了一种名为 CMCIR 的事件级别视觉问答框架，以实现稳健的因果感知视觉 - 语言问答，其利用因果干预方法发现视觉和语言两种模态的真实因果结构，并成功地在四个事件级别数据集上验证了其优越性。