The Vision-Language Foundation Model has recently shown outstanding
performance in various perception learning tasks. The outstanding performance
of the vision-language model mainly relies on large-scale pre-training datasets
and different data augmentation techniques. However, the domain generalization
problem of the vision-language foundation model needs to be addressed. This
problem has limited the generalizability of the vision-language foundation
model to unknown data distributions. In this paper, we introduce a new simple
but efficient Diffusion Sampling approach to Domain Generalization (ED-SAM) to
improve the generalizability of the vision-language foundation model. Our
theoretical analysis in this work reveals the critical role and relation of the
diffusion model to domain generalization in the vision-language foundation
model. Then, based on the insightful analysis, we introduce a new simple yet
effective Transport Transformation to diffusion sampling method. It can
effectively generate adversarial samples to improve the generalizability of the
foundation model against unknown data distributions. The experimental results
on different scales of vision-language pre-training datasets, including CC3M,
CC12M, and LAION400M, have consistently shown State-of-the-Art performance and
scalability of the proposed ED-SAM approach compared to the other recent
methods.

本文介绍了一种新的简单而高效的扩散采样方法（ED-SAM），以提高视觉语言基础模型的泛化能力，并通过生成对抗样本来改善模型对未知数据分布的适用性。实验结果表明，与其他最近的方法相比，所提出的 ED-SAM 方法在不同规模的视觉语言预训练数据集上一致展现了最先进的性能和可扩展性。

ED-SAM：一种高效的扩散抽样方法用于视觉 - 语言基础模型中的领域泛化

ED-SAM: An Efficient Diffusion Sampling Approach to Domain  Generalization in Vision-Language Foundation Models

The Vision-Language Foundation model is increasingly investigated in the
fields of computer vision and natural language processing, yet its exploration
in ophthalmology and broader medical applications remains limited. The
challenge is the lack of labeled data for the training of foundation model. To
handle this issue, a CLIP-style retinal image foundation model is developed in
this paper. Our foundation model, RET-CLIP, is specifically trained on a
dataset of 193,865 patients to extract general features of color fundus
photographs (CFPs), employing a tripartite optimization strategy to focus on
left eye, right eye, and patient level to reflect real-world clinical
scenarios. Extensive experiments demonstrate that RET-CLIP outperforms existing
benchmarks across eight diverse datasets spanning four critical diagnostic
categories: diabetic retinopathy, glaucoma, multiple disease diagnosis, and
multi-label classification of multiple diseases, which demonstrate the
performance and generality of our foundation model. The sourse code and
pre-trained model are available at this https URL

本文开发了一种 CLIP 风格的视网膜图像基础模型 RET-CLIP，该模型在 193,865 名患者的数据集上进行特训，能够在四个关键的诊断类别中优于现有基准，包括糖尿病视网膜病变，青光眼，多疾病诊断和多疾病的多标签分类。

RET-CLIP: 一种用临床诊断报告进行预训练的视网膜图像基准模型

RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical  Diagnostic Reports

Class Activation Map (CAM) has emerged as a popular tool for weakly
supervised semantic segmentation (WSSS), allowing the localization of object
regions in an image using only image-level labels. However, existing CAM
methods suffer from under-activation of target object regions and
false-activation of background regions due to the fact that a lack of detailed
supervision can hinder the model's ability to understand the image as a whole.
In this paper, we propose a novel Question-Answer Cross-Language-Image Matching
framework for WSSS (QA-CLIMS), leveraging the vision-language foundation model
to maximize the text-based understanding of images and guide the generation of
activation maps. First, a series of carefully designed questions are posed to
the VQA (Visual Question Answering) model with Question-Answer Prompt
Engineering (QAPE) to generate a corpus of both foreground target objects and
backgrounds that are adaptive to query images. We then employ contrastive
learning in a Region Image Text Contrastive (RITC) network to compare the
obtained foreground and background regions with the generated corpus. Our
approach exploits the rich textual information from the open vocabulary as
additional supervision, enabling the model to generate high-quality CAMs with a
more complete object region and reduce false-activation of background regions.
We conduct extensive analysis to validate the proposed method and show that our
approach performs state-of-the-art on both PASCAL VOC 2012 and MS COCO
datasets. Code is available at: this https URL

我们提出了一种基于问题回答跨语言图像匹配框架，利用视觉语言基础模型来最大化对图像的基于文本的理解，并引导激活图的生成，以解决现有激活地图方法在目标物体区域低激活和背景区域误激活的问题。