GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.

通过对 GPT-4V 在两个胸部 X 光报告数据集上生成报告的系统评估，我们发现它在词汇度量和临床有效性度量方面的表现均非常糟糕；我们将任务分解为两个步骤，即医学图像推理和（基于真实条件）生成报告，结果表明 GPT-4V 在图像推理方面的表现一直很差，而且即使在生成报告方面给予了真实条件，其生成的报告仍不如经过微调的 LLaMA-2 正确且自然。综上，我们对于在放射学工作流中使用 GPT-4V 的可行性提出了疑问。

GPT-4V 仍无法生成放射学报告