Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models' performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.

本研究解决了现有医学视觉语言预训练模型在面对不同文本提示时的性能不稳定问题。我们系统评估了三种流行的MedVLP方法在15种疾病上的提示敏感性，并发现所有模型在不同可解释性提示下的表现不均匀，揭示了对复杂医学概念理解的困难。这表明需要进一步改进MedVLP方法，以增强其面对多样化零-shot提示的鲁棒性。

多样化文本提示的可解释性如何影响医学视觉-语言零-shot任务？