Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

本研究解决了视频与文本之间的细粒度对齐问题，现有的视频多模态模型在像素级定位方面存在不足。我们提出的VideoGLaMM模型通过结合大型语言模型、双重视觉编码器和时空解码器，实现了有效的视觉-语言对齐及准确的掩码生成。实验结果表明，VideoGLaMM在基础对话生成、视觉定位和视频分割等三个具有挑战性的任务中均优于现有方法。

VideoGLaMM：一种用于视频中像素级视觉定位的大型多模态模型