Medical visual question answering (VQA) is a challenging task that requires answering clinical questions of a given medical image, by taking consider of both visual and language information. However, due to the small scale of training data for medical VQA, pre-training fine-tuning paradigms have been a commonly used solution to improve model generalization performance. In this paper, we present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text using medical image caption datasets, by leveraging both unimodal and multimodal contrastive losses, along with masked language modeling and image text matching as pretraining objectives. The pre-trained model is then transferred to downstream medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets with significant accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we conduct a comprehensive analysis to validate the effectiveness of different components of the approach and study different pre-training settings. Our codes and models are available at https://github.com/pengfeiliHEU/MUMC.

本文提出了一种新的自我监督方法来处理医学图像视觉问答问题，通过利用医学图像标题数据集来学习输入图像和文本的单模和多模特征表示，预训练模型后将其转移到下游的医学VQA任务中，已在三个公开的医学VQA数据集上取得了最先进的表现，具有显着的准确度提高。

利用单模态和多模态对比损失进行带有遮掩视觉和语言预训练，用于医学视觉问答