Medical vision language pre-training (VLP) has emerged as a frontier of
research, enabling zero-shot pathological recognition by comparing the query
image with the textual descriptions for each disease. Due to the complex
semantics of biomedical texts, current methods struggle to align medical images
with key pathological findings in unstructured reports. This leads to the
misalignment with the target disease's textual representation. In this paper,
we introduce a novel VLP framework designed to dissect disease descriptions
into their fundamental aspects, leveraging prior knowledge about the visual
manifestations of pathologies. This is achieved by consulting a large language
model and medical experts. Integrating a Transformer module, our approach
aligns an input image with the diverse elements of a disease, generating
aspect-centric image representations. By consolidating the matches from each
aspect, we improve the compatibility between an image and its associated
disease. Additionally, capitalizing on the aspect-oriented representations, we
present a dual-head Transformer tailored to process known and unknown diseases,
optimizing the comprehensive detection efficacy. Conducting experiments on
seven downstream datasets, ours outperforms recent methods by up to 8.07% and
11.23% in AUC scores for seen and novel categories, respectively. Our code is
released at
\href{https://github.com/HieuPhan33/MAVL}{this https URL}.

通过咨询大型语言模型和医学专家，我们提出了一种新颖的 VLP 框架，将疾病描述分解为基本要素，利用对病理学可视表现的先前知识。通过整合 Transformer 模块，我们的方法将输入图像与疾病的多个要素进行对齐，生成以要素为中心的图像表示。通过整合每个要素的匹配，我们改善了图像与其相关疾病之间的兼容性。此外，我们还提出了一个面向要素的双头 Transformer，用于处理已知和未知疾病，以优化综合检测效果。在七个数据集上进行实验证明，我们的方法在已见类别和新颖类别的 AUC 得分上分别超过最近的方法 8.07% 和 11.23%。

增强病理检测的疾病描述分解：一种多方面的视觉语言匹配框架

Decomposing Disease Descriptions for Enhanced Pathology Detection: A  Multi-Aspect Vision-Language Matching Framework

Expert annotation of 3D medical image for downstream analysis is
resource-intensive, posing challenges in clinical applications. Visual
self-supervised learning (vSSL), though effective for learning visual
invariance, neglects the incorporation of domain knowledge from medicine. To
incorporate medical knowledge into visual representation learning,
vision-language pre-training (VLP) has shown promising results in 2D image.
However, existing VLP approaches become generally impractical when applied to
high-resolution 3D medical images due to GPU hardware constraints and the
potential loss of critical details caused by downsampling, which is the
intuitive solution to hardware constraints. To address the above limitations,
we introduce T3D, the first VLP framework designed for high-resolution 3D
medical images. T3D incorporates two text-informed pretext tasks:
(\lowerromannumeral{1}) text-informed contrastive learning;
(\lowerromannumeral{2}) text-informed image restoration. These tasks focus on
learning 3D visual representations from high-resolution 3D medical images and
integrating clinical knowledge from radiology reports, without distorting
information through forced alignment of downsampled volumes with detailed
anatomical text. Trained on a newly curated large-scale dataset of 3D medical
images and radiology reports, T3D significantly outperforms current vSSL
methods in tasks like organ and tumor segmentation, as well as disease
classification. This underlines T3D's potential in representation learning for
3D medical image analysis. All data and code will be available upon acceptance.

T3D 是为高分辨率 3D 医学图像设计的首个 VLP 框架，通过两个文本感知的预训练任务，即文本感知的对比学习和文本感知的图像恢复，从高分辨率的 3D 医学图像中学习 3D 视觉表示，整合临床知识，拥有在器官和肿瘤分割以及疾病分类等任务中显著优于现有 vSSL 方法的潜力。

T3D：通过视觉 - 语言预训练实现三维医学图像理解

T3D: Towards 3D Medical Image Understanding through Vision-Language  Pre-training

Vision-language pre-training (VLP) on large-scale datasets has shown premier
performance on various downstream tasks. In contrast to plenty of available
benchmarks with English corpus, large-scale pre-training datasets and
downstream datasets with Chinese corpus remain largely unexplored. In this
work, we build a large-scale high-quality Chinese cross-modal benchmark named
ZERO for the research community, which contains the currently largest public
pre-training dataset ZERO-Corpus and five human-annotated fine-tuning datasets
for downstream tasks. ZERO-Corpus contains 250 million images paired with 750
million text descriptions, plus two of the five fine-tuning datasets are also
currently the largest ones for Chinese cross-modal downstream tasks. Along with
the ZERO benchmark, we also develop a VLP framework with pre-Ranking + Ranking
mechanism, boosted with target-guided Distillation and feature-guided
Distillation (R2D2) for large-scale cross-modal learning. A global contrastive
pre-ranking is first introduced to learn the individual representations of
images and texts. These primitive representations are then fused in a
fine-grained ranking manner via an image-text cross encoder and a text-image
cross encoder. The target-guided distillation and feature-guided distillation
are further proposed to enhance the capability of R2D2. With the ZERO-Corpus
and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve
downstream datasets from five broad categories of tasks including image-text
retrieval, image-text matching, image caption, text-to-image generation, and
zero-shot image classification. The datasets, models, and codes are available
at this https URL

该研究报告介绍了一个面向中文语料库的大规模高质量跨模态基准（ZERO），该基准包含了最大的公共预训练数据集 ZERO-Corpus 和用于下游任务的五个人工注释微调数据集，并且提出了一个基于预排序和排序机制的 VLP 框架（R2D2），该框架采用目标导向蒸馏和特征导向蒸馏技术，用于实现大规模跨模态学习，并在图像 - 文本检索、文本 - 图像匹配、图像字幕生成、文本到图像生成和零样本图像分类等五个范畴的任务上实现了最先进的表现。