BriefGPT.xyz
Sep, 2024
视觉上下文窗口扩展:长视频理解的新视角
Visual Context Window Extension: A New Perspective for Long Video Understanding
HTML
PDF
Hongchen Wei, Zhenzhong Chen
TL;DR
本研究解决了现有大型多模态模型在长视频理解中的不足,提出了一种通过扩展视觉上下文窗口的方法,以便在无需重新训练长视频数据集的情况下应用LMMs。研究结果表明,该方法在多个长视频理解基准上均显著提升了性能,尤其是在内存使用方面的改进减少约45%的记忆消耗,且不影响性能表现。
Abstract
Large
Multimodal Models
(LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to
Long Video Understanding
. In contrast, Large Language Models (L
→