Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs. Code is available at https://github.com/OpenGVLab/InternVL.

本研究解决了多模态大语言模型（MLLM）在消费者级GPU或边缘设备上训练和部署的高计算成本问题。我们提出的Mini-InternVL系列模型在参数仅为5%的情况下实现90%的性能，通过统一适应框架使其在一系列下游任务中超越专门模型，显著提升了MLLM的应用效能。

Mini-InternVL：一个灵活传输的口袋多模态模型，参数仅占5%且性能达90%