Multimodal pre-training demonstrates its potential in the medical domain,
which learns medical visual representations from paired medical reports.
However, many pre-training tasks require extra annotations from clinicians, and
most of them fail to explicitly guide the model to learn the desired features
of different pathologies. To the best of our knowledge, we are the first to
utilize Visual Question Answering (VQA) for multimodal pre-training to guide
the framework focusing on targeted pathological features. In this work, we
leverage descriptions in medical reports to design multi-granular
question-answer pairs associated with different diseases, which assist the
framework in pre-training without requiring extra annotations from experts. We
also propose a novel pre-training framework with a quasi-textual feature
transformer, a module designed to transform visual features into a
quasi-textual space closer to the textual domain via a contrastive learning
strategy. This narrows the vision-language gap and facilitates modality
alignment. Our framework is applied to four downstream tasks: report
generation, classification, segmentation, and detection across five datasets.
Extensive experiments demonstrate the superiority of our framework compared to
other state-of-the-art methods. Our code will be released upon acceptance.

我们利用多模态预训练中的视觉问题回答（VQA）指导框架，聚焦目标病理特征，通过医学报告中的描述设计了关联不同疾病的多粒度问题 - 答案对，并提出了一种基于准文本特征变换的新型预训练框架，将视觉特征转化为接近文本领域的准文本空间，缩小了视觉 - 语言差距，实现了模态对齐。在四个下游任务（报告生成、分类、分割和检测）的五个数据集上，广泛的实验证明了我们的框架相比其他最先进的方法的优越性。我们的代码将在接受后发布。