Medical vision-language models enable co-learning and integrating features from medical imaging and clinical text. However, these models are not easy to train and the latent representation space can be complex. Here we propose a novel way for pre-training and regularising medical vision-language models. The proposed method, named Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), leverages a frozen language model for training stability and efficiency and introduces a novel orthogonality loss to harmonize the latent space geometry. We demonstrate the potential of the pre-trained model on three downstream tasks: medical image classification, segmentation, and object detection. Extensive experiments across five public datasets demonstrate that M-FLAG significantly outperforms existing medical vision-language pre-training approaches and reduces the number of parameters by 78\%. Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1\% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100\% of the data.

该研究提出了一种名为Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG)的模型预训练方法，使用冻结的语言模型增强稳定性和效率，引入新的正交损失以谐调潜在空间几何结构，并在医学图像分类、分割和对象检测等三个下游任务中进行了广泛实验，结果显示M-FLAG显著优于现有的医学视觉语言预训练方法并将参数数量减少了78％，在只使用1％的RSNA数据的情况下，在分割任务上实现了出色的表现，甚至超过了使用100％数据进行微调的ImageNet预训练模型。

M-FLAG: 冻结语言模型和潜空间几何优化的医学视觉语言预训练