Localization plays a crucial role in enhancing the practicality and precision of VQA systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system's ability to provide contextually relevant and spatially accurate responses, crucial for applications in dynamic environments like robotics and augmented reality. However, traditional systems face challenges in accurately mapping objects within images to generate nuanced and spatially aware responses. In this work, we introduce "Detect2Interact", which addresses these challenges by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.

本研究提出了一种名为“Detect2Interact”的高级方法，通过细粒度的物体可视关键领域检测来解决传统系统在图像内准确映射物体以生成细致和准确空间感知响应方面面临的挑战。使用段落任意模型(SAM)生成图像中物体的详细空间地图，然后使用Vision Studio提取语义对象描述，最后运用GPT-4的常识知识来弥合物体语义和其空间地图之间的差距。结果表明，Detect2Interact在大量测试案例上实现了一致的定性结果，并通过提供更合理和更精细的视觉表示优于现有的具有物体检测能力的VQA系统。

Detect2Interact: 图像问答中物体关键字段的定位与交互