Text-Centric Visual Question Answering (TEC-VQA) in its proper format not
only facilitates human-machine interaction in text-centric visual environments
but also serves as a de facto gold proxy to evaluate AI models in the domain of
text-centric scene understanding. However, most TEC-VQA benchmarks have focused
on high-resource languages like English and Chinese. Despite pioneering works
to expand multilingual QA pairs in non-text-centric VQA datasets using
translation engines, the translation-based protocol encounters a substantial
``Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically,
it prioritizes the text in question-answer pairs while disregarding the visual
text present in images. Furthermore, it does not adequately tackle challenges
related to nuanced meaning, contextual distortion, language bias, and
question-type diversity. In this work, we address the task of multilingual
TEC-VQA and provide a benchmark with high-quality human expert annotations in 9
diverse languages, called MTVQA. To our knowledge, MTVQA is the first
multilingual TEC-VQA benchmark to provide human expert annotations for
text-centric scenarios. Further, by evaluating several state-of-the-art
Multimodal Large Language Models (MLLMs), including GPT-4V, on our MTVQA
dataset, it is evident that there is still room for performance improvement,
underscoring the value of our dataset. We hope this dataset will provide
researchers with fresh perspectives and inspiration within the community. The
MTVQA dataset will be available at
this https URL

本研究提供了一个多语言 TEC-VQA 的基准测试数据集 MTVQA，并通过评估多种先进的多模态大型语言模型在该数据集上的表现，发现仍有提高性能的空间，凸显了该数据集的价值。