Medical visual question answering (VQA) is a challenging task that requires
answering clinical questions of a given medical image, by taking consider of
both visual and language information. However, due to the small scale of
training data for medical VQA, pre-training fine-tuning paradigms have been a
commonly used solution to improve model generalization performance. In this
paper, we present a novel self-supervised approach that learns unimodal and
multimodal feature representations of input images and text using medical image
caption datasets, by leveraging both unimodal and multimodal contrastive
losses, along with masked language modeling and image text matching as
pretraining objectives. The pre-trained model is then transferred to downstream
medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA)
performance on three publicly available medical VQA datasets with significant
accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we
conduct a comprehensive analysis to validate the effectiveness of different
components of the approach and study different pre-training settings. Our codes
and models are available at this https URL

本文提出了一种新的自我监督方法来处理医学图像视觉问答问题，通过利用医学图像标题数据集来学习输入图像和文本的单模和多模特征表示，预训练模型后将其转移到下游的医学 VQA 任务中，已在三个公开的医学 VQA 数据集上取得了最先进的表现，具有显着的准确度提高。