We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.

我们在医学领域中对比了多模式表示学习的综合性基准测试。通过这项研究，我们旨在回答以下研究问题：（i）通用领域表示对医学领域有多大的可转移性？（ii）多模式对比训练是否足够，还是还需要单模态训练的益处？（iii）特征粒度对多模式医学表示学习的有效性有何影响？为了回答这些问题，我们在相同的训练设置下调查了八个对比学习方法，并使用来自四个数据集的280万个图像-文本对进行了训练，并在25个下游任务上进行了评估，包括分类（零样本和线性探测），图像到文本和文本到图像的检索，以及视觉问答。我们的研究结果表明，对于第一个问题，我们的答案是肯定的；对于第二个问题，我们的答案是否定的，而且学习细粒度特征具有益处。最后，我们公开了我们的代码。

医学表征学习的视觉-语言对比方法的基准评估