This paper presents a comprehensive evaluation of GPT-4V's capabilities across diverse medical imaging tasks, including Radiology Report Generation, Medical Visual Question Answering (VQA), and Visual Grounding. While prior efforts have explored GPT-4V's performance in medical imaging, to the best of our knowledge, our study represents the first quantitative evaluation on publicly available benchmarks. Our findings highlight GPT-4V's potential in generating descriptive reports for chest X-ray images, particularly when guided by well-structured prompts. However, its performance on the MIMIC-CXR dataset benchmark reveals areas for improvement in certain evaluation metrics, such as CIDEr. In the domain of Medical VQA, GPT-4V demonstrates proficiency in distinguishing between question types but falls short of prevailing benchmarks in terms of accuracy. Furthermore, our analysis finds the limitations of conventional evaluation metrics like the BLEU score, advocating for the development of more semantically robust assessment methods. In the field of Visual Grounding, GPT-4V exhibits preliminary promise in recognizing bounding boxes, but its precision is lacking, especially in identifying specific medical organs and signs. Our evaluation underscores the significant potential of GPT-4V in the medical imaging domain, while also emphasizing the need for targeted refinements to fully unlock its capabilities.

这篇论文全面评估了GPT-4V在不同的医学图像任务中的能力，包括放射学报告生成、医学视觉问答和视觉基础。我们的研究首次对公开可用的基准进行了定量评估，发现了GPT-4V在为胸部X射线图像生成描述性报告方面的潜力，特别是在有良好结构提示的引导下。然而，我们的发现也揭示了GPT-4V在某些评估指标（如CIDEr）上仍需改进，尤其是在MIMIC-CXR数据集基准上。在医学问答方面，虽然GPT-4V在区分问题类型方面表现出了熟练度，但在准确性方面还不及现有基准。此外，我们的分析发现了常规评估指标（如BLEU分数）的局限性，倡导发展更语义鲁棒的评估方法。在视觉基础领域，虽然GPT-4V在识别边界框方面显示了初步的潜力，但其精度不够，特别是在识别特定的医学器官和病症方面。我们的评估强调了GPT-4V在医学图像领域的重要潜力，同时也强调了需要针对性的改进来充分发挥其能力。

GPT-4V在医学影像中的多模态能力综合研究