We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

本研究解决了多模态大语言模型在图像理解和推理方面的能力不足问题。通过采用数据中心的方法，系统地研究不同数据组合对模型训练的影响，论文展示了高质量数据与优化训练策略的有效性。研究表明，即使在小规模模型（1B和3B参数）下，精心的数据策划也能显著提升性能，推动了未来多模态大语言模型的发展。

MM1.5：多模态大语言模型微调的方法、分析与洞察