3D visual grounding involves matching natural language descriptions with
their corresponding objects in 3D spaces. Existing methods often face
challenges with accuracy in object recognition and struggle in interpreting
complex linguistic queries, particularly with descriptions that involve
multiple anchors or are view-dependent. In response, we present the MiKASA
(Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model
integrates a self-attention-based scene-aware object encoder and an original
multi-key-anchor technique, enhancing object recognition accuracy and the
understanding of spatial relationships. Furthermore, MiKASA improves the
explainability of decision-making, facilitating error diagnosis. Our model
achieves the highest overall accuracy in the Referit3D challenge for both the
Sr3D and Nr3D datasets, particularly excelling by a large margin in categories
that require viewpoint-dependent descriptions.
The source code and additional resources for this project are available on
GitHub: this https URL

我们提出了 MiKASA（多键锚点场景感知）Transformer 模型，通过自注意力机制和多键锚点技术，提高了目标识别的准确性和对空间关系的理解，同时改善了决策的可解释性。在 Referit3D 挑战中，我们的模型在 Sr3D 和 Nr3D 数据集中取得了最高的准确度，并在需要依赖视角的描述方面表现出色。