Multimodal deep learning utilizing imaging and diagnostic reports has made impressive progress in the field of medical imaging diagnostics, demonstrating a particularly strong capability for auxiliary diagnosis in cases where sufficient annotation information is lacking. Nonetheless, localizing diseases accurately without detailed positional annotations remains a challenge. Although existing methods have attempted to utilize local information to achieve fine-grained semantic alignment, their capability in extracting the fine-grained semantics of the comprehensive contextual within reports is limited. To solve this problem, we introduce a new method that takes full sentences from textual reports as the basic units for local semantic alignment. Our approach combines chest X-ray images with their corresponding textual reports, performing contrastive learning at both global and local levels. The leading results obtained by our method on multiple datasets confirm its efficacy in the task of lesion localization.

利用图像和诊断报告的多模态深度学习在医学影像诊断领域取得了显著进展，尤其在缺乏足够注释信息的辅助诊断方面具有强大的能力，然而，没有详细位置注释的准确定位疾病仍然是一个挑战，现有的方法已尝试利用局部信息实现细粒度语义对齐，但其在提取综合上下文内的细粒度语义能力有限，为解决这一问题，我们提出了一种新的方法，将文本报告中的完整句子作为局部语义对齐的基本单元，我们的方法结合了胸部X射线图像和相应的文本报告，在全局和局部层面进行对比学习，我们的方法在多个数据集上取得的领先结果证实了其在病灶定位任务中的有效性。

病变定位的多模态自监督学习