BriefGPT.xyz
Jun, 2024
VoCo-LLaMA:面向大型语言模型的视觉压缩
VoCo-LLaMA: Towards Vision Compression with Large Language Models
HTML
PDF
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan...
TL;DR
基于 Vision-Language Models 的 VoCo-LLaMA 方法通过引入 Vision Compression tokens 和利用 attention distillation,实现了视觉压缩并提高推理效率,能够理解时间相关性,在多模态应用中具有广泛的潜力。
Abstract
vision-language models
(VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos.
→