Text-Centric Visual Question Answering (TEC-VQA) in its proper format not
only facilitates human-machine interaction in text-centric visual environments
but also serves as a de facto gold proxy to evaluate AI models in the domain of
text-centric scene understanding. However, most TEC-VQA benchmarks have focused
on high-resource languages like English and Chinese. Despite pioneering works
to expand multilingual QA pairs in non-text-centric VQA datasets using
translation engines, the translation-based protocol encounters a substantial
``Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically,
it prioritizes the text in question-answer pairs while disregarding the visual
text present in images. Furthermore, it does not adequately tackle challenges
related to nuanced meaning, contextual distortion, language bias, and
question-type diversity. In this work, we address the task of multilingual
TEC-VQA and provide a benchmark with high-quality human expert annotations in 9
diverse languages, called MTVQA. To our knowledge, MTVQA is the first
multilingual TEC-VQA benchmark to provide human expert annotations for
text-centric scenarios. Further, by evaluating several state-of-the-art
Multimodal Large Language Models (MLLMs), including GPT-4V, on our MTVQA
dataset, it is evident that there is still room for performance improvement,
underscoring the value of our dataset. We hope this dataset will provide
researchers with fresh perspectives and inspiration within the community. The
MTVQA dataset will be available at
this https URL

本研究提供了一个多语言 TEC-VQA 的基准测试数据集 MTVQA，并通过评估多种先进的多模态大型语言模型在该数据集上的表现，发现仍有提高性能的空间，凸显了该数据集的价值。

MTVQA：多语言基于文本为中心的视觉问答基准测试

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Text-centric visual question answering (VQA) has made great strides with the
development of Multimodal Large Language Models (MLLMs), yet open-source models
still fall short of leading models like GPT4V and Gemini, partly due to a lack
of extensive, high-quality instruction tuning data. To this end, we introduce a
new approach for creating a massive, high-quality instruction-tuning dataset,
Square-10M, which is generated using closed-source MLLMs. The data construction
process, termed Square, consists of four steps: Self-Questioning, Answering,
Reasoning, and Evaluation. Our experiments with Square-10M led to three key
findings: 1) Our model, TextSquare, considerably surpasses open-source previous
state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%).
It even outperforms top-tier models like GPT4V and Gemini in 6 of 10
text-centric benchmarks. 2) Additionally, we demonstrate the critical role of
VQA reasoning data in offering comprehensive contextual insights for specific
questions. This not only improves accuracy but also significantly mitigates
hallucinations. Specifically, TextSquare scores an average of 75.1% across four
general VQA and hallucination evaluation datasets, outperforming previous
state-of-the-art models. 3) Notably, the phenomenon observed in scaling
text-centric VQA datasets reveals a vivid pattern: the exponential increase of
instruction tuning data volume is directly proportional to the improvement in
model performance, thereby validating the necessity of the dataset scale and
the high quality of Square-10M.

TextSquare 通过使用 Square-10M 数据集，远远超过开源模型，提出了对文本中心的 MLLMs 进行调参的新方法，并在 OCR 评估中达到了新的标准 (62.2%)，同时在 6 个文本中心基准测试中胜过 GPT4V 和 Gemini 模型。此外，研究还展示了 VQA 推理数据在提供全面上下文洞察力方面的关键作用，并提高了准确性，显著减轻了幻觉。最后，研究揭示了文本中心 VQA 数据集规模的指数级增长与模型性能改善之间的关系，验证了数据集规模和 Square-10M 的高质量的必要性。