We introduce Blink, a new benchmark for multimodal language models (LLMs)
that focuses on core visual perception abilities not found in other
evaluations. Most of the Blink tasks can be solved by humans "within a blink"
(e.g., relative depth estimation, visual correspondence, forensics detection,
and multi-view reasoning). However, we find these perception-demanding tasks
cast significant challenges for current multimodal LLMs because they resist
mediation through natural language. Blink reformats 14 classic computer vision
tasks into 3,807 multiple-choice questions, paired with single or multiple
images and visual prompting. While humans get 95.70% accuracy on average, Blink
is surprisingly challenging for existing multimodal LLMs: even the
best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only
13.17% and 7.63% higher than random guessing, indicating that such perception
abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also
highlights that specialist CV models could solve these problems much better,
suggesting potential pathways for future improvements. We believe Blink will
stimulate the community to help multimodal LLMs catch up with human-level
visual perception.

Blink 是一个针对多模式语言模型（LLMs）的新基准，重点关注其他评估中找不到的核心视觉感知能力。通过对 14 个经典的计算机视觉任务进行改组，Blink 生成了 3,807 个多项选择题，配备单个或多个图像和视觉提示。虽然人类平均准确率为 95.70%，但 Blink 对现有的多模式 LLMs 具有意外的挑战性，即使是表现最佳的 GPT-4V 和 Gemini 准确率也只有 51.26% 和 45.72%，仅比随机猜测高出 13.17% 和 7.63%，表明这些感知能力在最近的多模式 LLMs 中尚未 “出现”。我们的分析还突出了专家级计算机视觉模型在解决这些问题方面表现更好，这为未来的改进提供了潜在途径。我们相信 Blink 将激发社区的努力，帮助多模式 LLMs 赶上人类水平的视觉感知。