Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful interaction and communication in the physical world. One such reasoning task is to describe the position of a target object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions. At the same time, model training requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA architectures trained on GRiD-A-3D. We demonstrate that within a few epochs, the subtasks required to reason over relative directions, such as recognizing and locating objects in a scene and estimating their intrinsic orientations, are learned in the order in which relative directions are intuitively processed.

本文介绍了一种基于抽象物体的新型诊断性视觉问答数据集GRiD-A-3D，以分析端到端VQA模型在相对方向上的地面能力的细粒度。同时，该数据集相对于现有数据集需要更少的计算资源，但具有相当甚至更高的性能。通过基于GRiD-A-3D训练的两个知名端到端VQA架构的彻底评估，本文证明了在相对指令直观处理的顺序中学习场景中物体的识别和定位以及估计它们内在方向的子任务。

通过多任务学习实现相对方向根据的全面 VQA 数据集：早期确定“正确性”含义