Vision-language supervision has made remarkable strides in learning visual representations from textual guidance. In digital pathology, vision-language models (VLM), pre-trained on curated datasets of histological image-captions, have been adapted to downstream tasks, such as region of interest classification. Zero-shot transfer for slide-level prediction has been formulated by MI-Zero, but it exhibits high variability depending on the textual prompts. Inspired by prototypical learning, we propose MI-VisionShot, a training-free adaptation method on top of VLMs to predict slide-level labels in few-shot learning scenarios. Our framework takes advantage of the excellent representation learning of VLM to create prototype-based classifiers under a multiple-instance setting by retrieving the most discriminative patches within each slide. Experimentation through different settings shows the ability of MI-VisionShot to surpass zero-shot transfer with lower variability, even in low-shot scenarios. Code coming soon at thttps://github.com/cvblab/MIVisionShot.

本研究解决了在组织病理学中，基于视觉语言模型（VLM）进行滑动级别分类时，零样本转移高变异性的问题。提出的MI-VisionShot方法结合了原型学习，利用VLM进行训练前适应，实现了在少量样本学习场景下的滑动级标签预测，实验结果表明其表现超过传统零样本转移，变异性更低，具有潜在的临床应用价值。

MI-VisionShot：用于组织病理图像滑动级分类的视觉语言模型少量学习适应