Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot vision tasks. These large pre-trained architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and arbitrary visual data. Despite this versatility, IT-LVLM effectiveness in fundamental computer vision problems remains unclear, primarily due to the absence of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the performance of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 279K image-question pairs, and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs, where the language output refers to visual concepts that lack any effective grounding in the image. Our results show that state-of-the-art IT-LVMLs are still limited at identifying fine-grained visual concepts, object hallucinations are common across tasks, and their results are strongly biased by small variations in the input query, even if the queries have the very same semantics. Our findings also suggest that these models have weak visual groundings but they can still make adequate guesses by global visual patterns or textual biases contained in the LLM component.

本文介绍了一个名为MERLIM的多模式评估基准，用于评估IT-LVLM在基本计算机视觉任务中的表现，发现先进的IT-LVLM仍然有限于识别精细的视觉概念，对象幻觉在各种任务中普遍存在，而且结果受输入查询的细微变化的强烈偏见影响，即使查询具有相同的语义。研究结果还表明，这些模型在视觉基础上较弱，但仍然可以通过全局视觉模式或LLM组件中的文本偏见进行恰当的猜测。

魔法后的MERLIM: 大型图像-语言模型的多模态评估基准