3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. While previous works attempt to exploit the verbo-visual relation with proposed cross-modal transformers, unstructured natural utterances and scattered objects might lead to undesirable performances. In this paper, we introduce DOrA, a novel 3D visual grounding framework with Order-Aware referring. DOrA is designed to leverage Large Language Models (LLMs) to parse language description, suggesting a referential order of anchor objects. Such ordered anchor objects allow DOrA to update visual features and locate the target object during the grounding process. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in both low-resource and full-data scenarios. In particular, DOrA surpasses current state-of-the-art frameworks by 9.3% and 7.8% grounding accuracy under 1% data and 10% data settings, respectively.

DOrA是一个使用大型语言模型的3D视觉指向框架，通过引入有序锚定对象，更新视觉特征并定位目标对象，在低资源和全数据场景下表现出超越当前最先进框架的优越性，分别在1％数据和10％数据设置下将基准提高了9.3％和7.8％的准确率。

DOrA：具有顺序感的三维视觉连接