Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks, yet their capabilities in fine-grained visual understanding remain insufficiently evaluated. Existing benchmarks either contain limited fine-grained evaluation samples that are mixed with other data, or are confined to object-level assessments in natural images. To holistically assess LVLMs' fine-grained visual understanding capabilities, we propose using document images with multi-granularity and multi-modal information to supplement natural images. In this light, we construct MMDocBench, a benchmark with various OCR-free document understanding tasks for the evaluation of fine-grained visual perception and reasoning abilities. MMDocBench defines 15 main tasks with 4,338 QA pairs and 11,353 supporting regions, covering various document images such as research papers, receipts, financial reports, Wikipedia tables, charts, and infographics. Based on MMDocBench, we conduct extensive experiments using 13 open-source and 3 proprietary advanced LVLMs, assessing their strengths and weaknesses across different tasks and document image types. The benchmark, task instructions, and evaluation code will be made publicly available.

该研究针对大型视觉语言模型（LVLMs）在细粒度视觉理解方面评估不足的问题，提出了MMDocBench，一个综合评估这些模型在文档图像理解能力的新基准。通过定义15个主要任务，涵盖各种文档图像，包括研究论文和财务报告，研究发现LVLMs在不同任务和文档类型上的优势和劣势，为改进其性能提供了重要依据。

MMDocBench：大型视觉语言模型的细粒度视觉文档理解基准