Large vision-language models (LVLMs) hallucinate: certain context cues in an
image may trigger the language module's overconfident and incorrect reasoning
on abnormal or hypothetical objects. Though a few benchmarks have been
developed to investigate LVLM hallucinations, they mainly rely on hand-crafted
corner cases whose fail patterns may hardly generalize, and finetuning on them
could undermine their validity. These motivate us to develop the first
automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few
principal strategies to create diverse hallucination examples. It probes the
language modules in LVLMs for context cues and uses them to synthesize images
by: (1) adding objects abnormal to the context cues; (2) for two co-occurring
objects, keeping one and excluding the other; or (3) removing objects closely
tied to the context cues. It then generates image-based questions whose
ground-truth answers contradict the language module's prior. A model has to
overcome contextual biases and distractions to reach correct answers, while
incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION
enables us to create new benchmarks at the minimum cost and thus overcomes the
fragility of hand-crafted benchmarks. It also reveals common failure patterns
and reasons, providing key insights to detect, avoid, or control
hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g.,
GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and
98.7% success rate of hallucination induction on synthetic and real-world
datasets of AUTOHALLUSION, paving the way for a long battle against
hallucinations.

大型视觉 - 语言模型存在幻觉问题，该研究开发了自动生成幻觉的基准测试方法 AUTOHALLUSION，通过识别上下文线索并以此生成图像和问题，揭示了幻觉的常见失败模式和原因。对顶级视觉 - 语言模型进行综合评估发现，在 AUTOHALLUSION 的合成和真实世界数据集上，幻觉诱导成功率达到了 97.7% 和 98.7%，为解决幻觉问题提供了新的思路。

AUTOHALLUSION：视觉语言模型的自动生成幻觉基准

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for  Vision-Language Models

We describe WebSuite, the first diagnostic benchmark for generalist web
agents, designed to systematically evaluate why agents fail. Advances in AI
have led to the rise of numerous web agents that autonomously operate a browser
to complete tasks. However, most existing benchmarks focus on strictly
measuring whether an agent can or cannot complete a task, without giving
insight on why. In this paper, we 1) develop a taxonomy of web actions to
facilitate identifying common failure patterns, and 2) create an extensible
benchmark suite to assess agents' performance on our taxonomized actions. This
benchmark suite consists of both individual tasks, such as clicking a button,
and end-to-end tasks, such as adding an item to a cart, and is designed such
that any failure of a task can be attributed directly to a failure of a
specific web action. We evaluate two popular generalist web agents, one
text-based and one multimodal, and identify unique weaknesses for each agent.
Because WebSuite can disaggregate task failures into specific action failures,
this enables granular identification of which UX flows an individual agent has
trouble with and immediately highlights promising avenues for improvement.
These findings highlight the need for more focused benchmarking on where web
agents go wrong to effectively improve agents beyond their weaker performance
today.

WebSuite 是第一个用于评估为何代理失败的通用 Web 代理的诊断基准，并通过将任务失败分解成特定的操作失败，针对 Web 代理性能的可改进之处进行了详细评估，以及需要更多关注代理失败方面的基准测试。