We present LoCoVQA, a dynamic benchmark generator for evaluating long-context
extractive reasoning in vision language models (VLMs). LoCoVQA augments test
examples for mathematical reasoning, VQA, and character recognition tasks with
increasingly long visual contexts composed of both in-distribution and
out-of-distribution distractor images.
Across these tasks, a diverse set of VLMs rapidly lose performance as the
visual context length grows, often exhibiting a striking exponential decay
trend. This test assesses how well VLMs can ignore irrelevant information when
answering queries -- a task that is quite easy for language models (LMs) in the
text domain -- demonstrating that current state-of-the-art VLMs lack this
essential capability for many long-context applications.

LoCoVQA 是一个用于评估视觉语言模型（VLM）中的长篇上下文抽取推理的动态基准生成器。该测试评估了 VLM 在回答问题时如何忽略无关信息的能力，表明目前的最先进 VLM 在许多长篇上下文应用中缺乏这种关键能力。