long video question answering is a challenging task that involves recognizing
short-term activities and reasoning about their fine-grained relationships.
State-of-the-art video large language models (vLLMs) hold
使用长视频理解任务中的 Large Language Models(LLMs)面临的挑战,本文提出了一种名为 LongVLM 的 VideoLLM 模型,通过分解长视频为短期片段,并使用分层令牌合并模块编码局部特征,维护顺序,整合全局语义信息,实现对长期视频的全面理解。实验证明了该模型在视频理解任务中的优越性能。