Large Language Models (LLMs) have introduced a new era of proficiency in comprehending complex healthcare and biomedical topics. However, there is a noticeable lack of models in languages other than English and models that can interpret multi-modal input, which is crucial for global healthcare accessibility. In response, this study introduces Qilin-Med-VL, the first Chinese large vision-language model designed to integrate the analysis of textual and visual data. Qilin-Med-VL combines a pre-trained Vision Transformer (ViT) with a foundational LLM. It undergoes a thorough two-stage curriculum training process that includes feature alignment and instruction tuning. This method enhances the model's ability to generate medical captions and answer complex medical queries. We also release ChiMed-VL, a dataset consisting of more than 1M image-text pairs. This dataset has been carefully curated to enable detailed and comprehensive interpretation of medical data using various types of images.

该研究介绍了Qilin-Med-VL，首个中文大型视觉语言模型，旨在集成对图文数据的分析，通过预训练的Vision Transformer和基础语言模型的两阶段课程训练过程增强了生成医学字幕和回答复杂医学查询的能力，同时发布了由超过1M个图文对组成的ChiMed-VL数据集，可用于对医学数据进行详细和全面的解释。

Qilin-Med-VL: 面向通用医疗的中文大规模视觉语言模型