We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

我们提出了InternLM-XComposer-2.5（IXC-2.5），这是一个支持长上下文输入和输出的多功能大视野语言模型。通过使用包含24K交错的图像文本上下文进行训练，IXC-2.5可以无缝扩展到96K的长上下文，从而在需要广泛输入和输出上下文的任务中表现出色。在图像-文本理解方面，IXC-2.5具备超高分辨率理解、细粒度视频理解和多回合多图像对话三个重要升级。另外，在文本-图像组合方面，IXC-2.5通过使用额外的LoRA参数，扩展到两个引人注目的应用：网页构建和高质量的文本-图像文章创作。在28个基准测试中，IXC-2.5在16个基准测试中表现出色，优于已有的开源最先进模型，在16个关键任务上超过或接近GPT-4V和Gemini Pro。InternLM-XComposer-2.5可在指定的URL上公开获取。

InternLM-XComposer-2.5：一款支持长上下文输入输出的多功能大规模视觉语言模型