The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.

本研究旨在解决现有视频理解模型处理长视频时的局限性，尤其是在复杂问题回答任务中的效率问题。通过首次引入自适应跨模态记忆压缩方法，AdaCM$^2$有效地提高了视频与文本的对齐能力，同时显著降低了内存使用。实验结果表明，AdaCM$^2$在多个数据集上实现了最先进的性能，尤其在LVU数据集中各任务的表现提高了4.5%，同时GPU内存消耗减少了65%。

AdaCM$^2$: 理解极长视频的自适应跨模态记忆压缩