While human speakers use a variety of different expressions when describing
the same object in an image, giving rise to a distribution of plausible labels
driven by pragmatic constraints, the extent to which current Vision \& Language
Large Language Models (VLLMs) can mimic this crucia