Multimodal pre-training demonstrates its potential in the medical domain,
which learns medical visual representations from paired medical reports.
However, many pre-training tasks require extra annotations from clinicians, and
most of them fail to explicitly guide the model to learn the desired features
of different pathologies. To the best of our knowledge, we are the first to
utilize Visual Question Answering (VQA) for multimodal pre-training to guide
the framework focusing on targeted pathological features. In this work, we
leverage descriptions in medical reports to design multi-granular
question-answer pairs associated with different diseases, which assist the
framework in pre-training without requiring extra annotations from experts. We
also propose a novel pre-training framework with a quasi-textual feature
transformer, a module designed to transform visual features into a
quasi-textual space closer to the textual domain via a contrastive learning
strategy. This narrows the vision-language gap and facilitates modality
alignment. Our framework is applied to four downstream tasks: report
generation, classification, segmentation, and detection across five datasets.
Extensive experiments demonstrate the superiority of our framework compared to
other state-of-the-art methods. Our code will be released upon acceptance.

我们利用多模态预训练中的视觉问题回答（VQA）指导框架，聚焦目标病理特征，通过医学报告中的描述设计了关联不同疾病的多粒度问题 - 答案对，并提出了一种基于准文本特征变换的新型预训练框架，将视觉特征转化为接近文本领域的准文本空间，缩小了视觉 - 语言差距，实现了模态对齐。在四个下游任务（报告生成、分类、分割和检测）的五个数据集上，广泛的实验证明了我们的框架相比其他最先进的方法的优越性。我们的代码将在接受后发布。

根据要求进行设计：利用视觉问答进行多模态预训练

Design as Desired: Utilizing Visual Question Answering for Multimodal  Pre-training

Learning medical visual representations through vision-language pre-training
has reached remarkable progress. Despite the promising performance, it still
faces challenges, i.e., local alignment lacks interpretability and clinical
relevance, and the insufficient internal and external representation learning
of image-report pairs. To address these issues, we propose an Anatomical
Structure-Guided (ASG) framework. Specifically, we parse raw reports into
triplets <anatomical region, finding, existence>, and fully utilize each
element as supervision to enhance representation learning. For anatomical
region, we design an automatic anatomical region-sentence alignment paradigm in
collaboration with radiologists, considering them as the minimum semantic units
to explore fine-grained local alignment. For finding and existence, we regard
them as image tags, applying an image-tag recognition decoder to associate
image features with their respective tags within each sample and constructing
soft labels for contrastive learning to improve the semantic association of
different image-report pairs. We evaluate the proposed ASG framework on two
downstream tasks, including five public benchmarks. Experimental results
demonstrate that our method outperforms the state-of-the-art methods.

通过视觉语言预训练学习医学视觉表示已取得显著进展，本研究提出了一种以解剖结构为指导的框架（ASG），以解决局部对齐的可解释性和临床相关性不足，以及图像 - 报告对的内外表示学习不足的问题。通过自动解剖句子对齐，并将发现和存在视为图像标签，该方法在五个公共基准数据集上展示出优于现有方法的实验结果。

解剖结构导向的医学视觉语言预训练

Anatomical Structure-Guided Medical Vision-Language Pre-training

Learning medical visual representations directly from paired radiology
reports has become an emerging topic in representation learning. However,
existing medical image-text joint learning methods are limited by instance or
local supervision analysis, ignoring disease-level semantic correspondences. In
this paper, we present a novel Multi-Granularity Cross-modal Alignment (MGCA)
framework for generalized medical visual representation learning by harnessing
the naturally exhibited semantic correspondences between medical image and
radiology reports at three different levels, i.e., pathological region-level,
instance-level, and disease-level. Specifically, we first incorporate the
instance-wise alignment module by maximizing the agreement between image-report
pairs. Further, for token-wise alignment, we introduce a bidirectional
cross-attention strategy to explicitly learn the matching between fine-grained
visual tokens and text tokens, followed by contrastive learning to align them.
More important, to leverage the high-level inter-subject relationship semantic
(e.g., disease) correspondences, we design a novel cross-modal disease-level
alignment paradigm to enforce the cross-modal cluster assignment consistency.
Extensive experimental results on seven downstream medical image datasets
covering image classification, object detection, and semantic segmentation
tasks demonstrate the stable and superior performance of our framework.

本文提出了一种基于多粒度跨模态对齐的框架，通过利用病理区域级别、实例级别和疾病级别上医学图像和放射学报告之间的自然语义一致性来学习泛化的医学视觉表征，实验结果表明，该方法在涵盖了图像分类、物体检测和语义分割等七个下游医疗图像任务上表现出稳定和卓越的性能。