Visual Question Answering (VQA) with multiple choice questions enables a vision-centric evaluation of Multimodal Large Language Models (MLLMs). Although it reliably checks the existence of specific visual abilities, it is easier for the model to select an answer from multiple choices (VQA evaluation) than to generate the answer itself. In this work, we offer a novel perspective: we evaluate how well an MLLM understands a specific visual concept by its ability to uniquely describe two extremely similar images that differ only in the targeted visual concept. Specifically, we assess the ability of MLLMs to capture specific points of visual differences using self-retrieval, i.e., by retrieving the target image using its generated caption against the other image in the pair serving as the distractor. We curate 247 highly similar image pairs as part of the D3 benchmark. For each image pair, the model is prompted to: (1) Detect a specific visual difference, and (2) Describe the target image uniquely such that it (3) Discriminates the target image from the distractor. Self-retrieval within D3 enables whitebox evaluation across six different visual patterns, revealing that current models struggle to independently discern fine-grained visual differences, with open-source models failing to outperform random guess.

本研究针对多模态大型语言模型（MLLM）在视觉理解方面的评估，提出了一种新方法，强调模型在独特描述极为相似图像时的能力。通过自我检索机制，使用D3基准测试，我们发现当前模型在细微视觉差异的辨别上表现欠佳，且开源模型的表现甚至未能超越随机猜测。

超越视觉问答：MLLM评估的新方法