The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

本研究解决了现有视觉问答基准依赖开放性问题而导致评估不准确的问题。通过引入AutoConverter框架，研究者能够将开放性问题自动转换为多项选择题，从而实现客观评估并降低问题创建成本。实验结果表明，使用AutoConverter生成的多项选择题具有挑战性，且视觉语言模型在准确性上与人工创建的问题表现相似或更低，建立了VMCBench这一新的统一多项选择基准，推进了视觉语言模型评估的标准化与可重复性。

视觉语言模型评估的挑战性多项选择题的自动生成