Robots interacting with humans through natural language can unlock numerous applications such as Referring Grasp Synthesis (RGS). Given a text query, RGS determines a stable grasp pose to manipulate the referred object in the robot's workspace. RGS comprises two steps: visual grounding and grasp pose estimation. Recent studies leverage powerful Vision-Language Models (VLMs) for visually grounding free-flowing natural language in real-world robotic execution. However, comparisons in complex, cluttered environments with multiple instances of the same object are lacking. This paper introduces HiFi-CS, featuring hierarchical application of Featurewise Linear Modulation (FiLM) to fuse image and text embeddings, enhancing visual grounding for complex attribute rich text queries encountered in robotic grasping. Visual grounding associates an object in 2D/3D space with natural language input and is studied in two scenarios: Closed and Open Vocabulary. HiFi-CS features a lightweight decoder combined with a frozen VLM and outperforms competitive baselines in closed vocabulary settings while being 100x smaller in size. Our model can effectively guide open-set object detectors like GroundedSAM to enhance open-vocabulary performance. We validate our approach through real-world RGS experiments using a 7-DOF robotic arm, achieving 90.33\% visual grounding accuracy in 15 tabletop scenes. We include our codebase in the supplementary material.

本研究旨在解决在复杂、杂乱环境中对同一对象的视觉定位与抓取姿态估计的不足。提出了HiFi-CS方法，通过分层地应用特征线性调制（FiLM）来融合图像和文本嵌入，显著提高了开放词汇设置中的视觉定位精度。实验结果表明，该模型在15个桌面场景中实现了90.33%的视觉定位准确率，展示了其在机器人抓取任务中的潜在影响。

HiFi-CS：面向机器人抓取的开放词汇视觉定位