The Large Vision-Language Model (LVLM) has enhanced the performance of
various downstream tasks in visual-language understanding. Most existing
approaches encode images and videos into separate feature spaces, which are
then fed as inputs to large language models. However, due to the lack of
unified tokenization for images and videos, namely misalignment before
projection, it becomes challenging for a Large Language Model (LLM) to learn
multi-modal interactions from several poor projection layers. In this work, we
unify visual representation into the language feature space to advance the
foundational LLM towards a unified LVLM. As a result, we establish a simple but
robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images
and videos, mutually enhancing each other. Video-LLaVA achieves superior
performances on a broad range of 9 image benchmarks across 5 image
question-answering datasets and 4 image benchmark toolkits. Additionally, our
Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on
MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive
experiments demonstrate that Video-LLaVA mutually benefits images and videos
within a unified visual representation, outperforming models designed
specifically for images or videos.

该研究论文提出了一种统一的大规模视觉语言模型（LVLM），通过在语言特征空间中统一视觉表示，学习多模态交互，从而在图像和视频基准任务上取得了卓越性能。