We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

我们介绍了Xmodel-VLM，这是一款先进的多模态视觉语言模型，旨在在消费级GPU服务器上进行高效部署。我们的工作直接应对了一个关键行业问题，即解决了巨大规模多模态系统普及所面临的高昂服务成本限制。通过严格的训练，我们从头开始开发了一个10亿级的语言模型，采用了LLaVA模式用于模态对齐。结果是，我们称之为Xmodel-VLM的模型，它既轻量又强大。通过在多个经典多模态基准测试中进行了全面的测试，我们发现尽管模型体积较小且执行速度更快，但Xmodel-VLM的性能与较大模型相当。我们的模型检查点和代码公开在GitHub上提供。

Xmodel-VLM: 一个简单的多模态视觉语言模型基准