We present MobileVLM, a competent multimodal vision language model (MMVLM)
targeted to run on mobile devices. It is an amalgamation of a myriad of
architectural designs and techniques that are mobile-oriented, which comprises
a set of language models at the scale of 1.4B and 2.7B parameters, trained from
scratch, a multimodal vision model that is pre-trained in the CLIP fashion,
cross-modality interaction via an efficient projector. We evaluate MobileVLM on
several typical VLM benchmarks. Our models demonstrate on par performance
compared with a few much larger models. More importantly, we measure the
inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin
GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens
per second, respectively. Our code will be made available at:
this https URL

我们提出了 MobileVLM，这是一个专为移动设备设计的多模式视觉语言模型（MMVLM）。它集成了各种移动设备定向的架构设计和技术，包括一组从头训练的 1.4B 和 2.7B 参数规模的语言模型，以及使用 CLIP 风格预训练的多模式视觉模型，通过高效的投影实现跨模态交互。我们在几个典型的 VLM 基准测试上评估了 MobileVLM。与一些更大的模型相比，我们的模型表现出与之相当的性能。更重要的是，我们在高通骁龙 888 CPU 和 NVIDIA Jeston Orin GPU 上测量了推断速度，分别获得了 21.5 个 token 和 65.3 个 token 每秒的最新性能。我们的代码将在此 https URL 上提供。