Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

本研究聚焦于当前视觉-语言模型（VLMs）在指称表达生成（REG）任务中的不足，特别是忽视了实用沟通的原理。我们提出了一个新的数据集（RefOI），并通过对先进VLMs的系统评估，揭示了这些模型在唯一识别参照物、包含多余信息以及与人类实用偏好不匹配等方面的三大关键缺陷。研究结果强调了需要关注实用性模型及评估框架，以更好地契合实际人际沟通。

视觉-语言模型在指称表达生成中的实用能力不足