The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and data will be publicly available.

本研究解决了多模态大语言模型（MLLMs）训练数据缺乏和获取成本高的问题。提出了新方法"绿洲"，通过仅使用图像合成高质量多模态数据，显著提高数据多样性和质量控制。实验结果显示，该方法能显著提升MLLMs的性能，并专注于特定领域的能力。

绿洲：一张图像即可满足多模态指令数据合成需求