Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

本研究解决了现有视频编码器未充分利用多模态大型语言模型（MLLMs）所带来的外部先验的问题。我们提出了一种横跨模态的视频编码统一范式，通过将视频拆分为空间内容和运动成分来实现更紧凑的表示。实验结果表明，TT2V模式能有效重建语义信息，而IT2V模式则展现出良好的感知一致性，指明了视频编码未来的研究方向。

当视频编码遇到多模态大型语言模型：视频编码的统一范式