Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using large language models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models. The code and model weights will be released upon the paper's acceptance.

本文提出了一种基于多任务自监督学习的大规模医学VQA任务框架（MISS），将医学VQA作为生成任务，并通过多任务学习对齐图像-文本特征；此外，我们通过使用大语言模型（LLMs），在单模态图像数据集上扩展单一模态图像特征空间，使得传统医学视觉领域任务数据能够应用于VLP，实验证明我们的方法在较少的多模态数据集上取得了优异结果并展示了生成式VQA模型的优势。

MISS：一个用于医学视觉问答的生成预训练和微调方法