The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

本研究针对视觉-语言模型(VLM)领域的关键发展问题，提供了当前主流方法的全面概述，对各自的优缺点进行了分析，并建议了一些未被充分探索的研究方向。通过构建高效的VLM Idefics3-8B，显著提升了文档理解能力，并创造了一个比以往大240倍的数据集Docmatix，扩展了相关研究的可能性。

构建和更好理解视觉-语言模型：洞察与未来方向