Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. However, the number of image-text pairs in medical datasets is usually orders of magnitude smaller than that in natural datasets. Besides, medical image-text pairs often involve numerous complex fine-grained correspondences. This paper aims to enhance the data efficiency by introducing multiple-to-multiple local relationship modeling to capture denser supervisions. More specifically, we propose a Medical Language-Image Pre-training (MLIP) framework, which exploits the limited image-text medical data more efficiently through patch-sentence matching. Furthermore, we introduce a masked contrastive learning strategy with semantic integrity estimation to reduce redundancy in images while preserving the underlying semantics. Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.

本文旨在通过引入多对多局部关系建模来增强数据效率，从而更有效地利用有限的医学图像文本数据。我们提出了医学语言-图像预训练（MLIP）框架，通过补丁-句子匹配的方式更有效地利用图像-文本医学数据，同时引入遮蔽对比学习策略和语义完整性估计以减少图像中的冗余并保留其底层语义。我们的评估结果显示，MLIP在零/少样本分类和少样本分割任务中表现出较大的优势。

MLIP：医学语言-图像预训练与遮蔽局部表示学习