BriefGPT.xyz
Mar, 2024
图像在第二层之后价值为1/2令牌:大型视觉语言模型的即插即用推理加速
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
HTML
PDF
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin...
TL;DR
通过学习自适应注意力模式和精简视觉标记,FastV可显著降低计算成本并在各种图像和视频理解任务中保持优秀性能,有助于在边缘设备和商业模型中部署大规模视觉-语言模型。
Abstract
In this study, we identify the inefficient
attention
phenomena in
large vision-language models
(LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the
→