Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.

使用大型语言模型对医学成像报告进行评估的一项新颖评估框架，通过与放射科医生评估结果的对比，提出了一种性能接近GPT-4的度量标准。为了降低成本并提高可访问性，利用语言模型评估结果构建数据集，进行了知识蒸馏以训练较小的模型，该模型的评估能力与GPT-4相当，为医学成像报告生成提供了一种易于使用和高效的评估方法，促进了更具临床相关性的模型的开发，该模型将进一步开源和提供可访问性。

LLM-RadJudge：X光报告生成实现放射科医师级评估