BriefGPT.xyz
Jul, 2024
视觉语言模型的失明
Vision language models are blind
HTML
PDF
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen
TL;DR
利用大型语言模型嵌入视觉功能的研究表明,当前最先进的模型在一些简单的视觉任务上表现严重不足,其视觉能力相当于近视者模糊地看到细节,甚至盲人也可以进行有根据的猜测。
Abstract
large language models
with
vision capabilities
(VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless
image-text applications
and
→