Cross-modal medical image-report retrieval task plays a significant role in
clinical diagnosis and various medical generative tasks. Eliminating
heterogeneity between different modalities to enhance semantic consistency is
the key challenge of this task. The current Vision-Language Pretraining (VLP)
models, with cross-modal contrastive learning and masked reconstruction as
joint training tasks, can effectively enhance the performance of cross-modal
retrieval. This framework typically employs dual-stream inputs, using unmasked
data for cross-modal contrastive learning and masked data for reconstruction.
However, due to task competition and information interference caused by
significant differences between the inputs of the two proxy tasks, the
effectiveness of representation learning for intra-modal and cross-modal
features is limited. In this paper, we propose an efficient VLP framework named
Masked Contrastive and Reconstruction (MCR), which takes masked data as the
sole input for both tasks. This enhances task connections, reducing information
interference and competition between them, while also substantially decreasing
the required GPU memory and training time. Moreover, we introduce a new
modality alignment strategy named Mapping before Aggregation (MbA). Unlike
previous methods, MbA maps different modalities to a common feature space
before conducting local feature aggregation, thereby reducing the loss of
fine-grained semantic information necessary for improved modality alignment.
Additionally, due to using only masked input, our method significantly reduces
the gpu memory and time required for training. Qualitative and quantitative
experiments conducted on the MIMIC-CXR dataset validate the effectiveness of
our approach, demonstrating state-of-the-art performance in medical cross-modal
retrieval tasks.

提出了一种名为蒙版对比与重建（MCR）的高效 VLP 框架，以蒙版数据作为两个任务的唯一输入，增强任务之间的连接，并显著减少所需的 GPU 内存和训练时间。通过映射不同的模态到一个公共特征空间，然后进行局部特征聚合，减少细粒度语义信息的损失，从而降低了 fine-grained 的模态对齐所需要的 gpu 内存和时间。在 MIMIC-CXR 数据集上进行的定性和定量实验验证了该方法的有效性，并展示了在医学跨模态检索任务中的最先进性能。