Large language models (LLMs) like ChatGPT show excellent capabilities in
various natural language processing tasks, especially for text generation. The
effectiveness of LLMs in summarizing radiology report impressions remains
unclear. In this study, we explore the capability of eight LLMs on the
radiology report impression summarization. Three types of radiology reports,
i.e., CT, PET-CT, and Ultrasound reports, are collected from Peking University
Cancer Hospital and Institute. We use the report findings to construct the
zero-shot, one-shot, and three-shot prompts with complete example reports to
generate the impressions. Besides the automatic quantitative evaluation
metrics, we define five human evaluation metrics, i.e., completeness,
correctness, conciseness, verisimilitude, and replaceability, to evaluate the
semantics of the generated impressions. Two thoracic surgeons (ZSY and LB) and
one radiologist (LQ) compare the generated impressions with the reference
impressions and score each impression under the five human evaluation metrics.
Experimental results show that there is a gap between the generated impressions
and reference impressions. Although the LLMs achieve comparable performance in
completeness and correctness, the conciseness and verisimilitude scores are not
very high. Using few-shot prompts can improve the LLMs' performance in
conciseness and verisimilitude, but the clinicians still think the LLMs can not
replace the radiologists in summarizing the radiology impressions.

研究了 8 种大型语言模型对放射学报告印象进行总结的能力，使用 CT、PET-CT 和超声波报告构建零、一、三次扫描提示，并定义了五项人工评价指标以评估印象的语义，结果显示大型语言模型在完整性和正确性方面表现较好，但简洁性和真实性评分不高，并指出少量扫描提示可以提高模型的简洁性和真实性，但临床医师仍认为大型语言模型不能取代放射学家的总结能力。

大型语言模型在总结放射学报告印象方面的现状

The current status of large language models in summarizing radiology  report impressions

In this paper, we explore the application of large language models (LLMs) for
generating code-tracing questions in introductory programming courses. We
designed targeted prompts for GPT4, guiding it to generate code-tracing
questions based on code snippets and descriptions. We established a set of
human evaluation metrics to assess the quality of questions produced by the
model compared to those created by human experts. Our analysis provides
insights into the capabilities and potential of LLMs in generating diverse
code-tracing questions. Additionally, we present a unique dataset of human and
LLM-generated tracing questions, serving as a valuable resource for both the
education and NLP research communities. This work contributes to the ongoing
dialogue on the potential uses of LLMs in educational settings.

我们探讨了在初级编程课程中应用大型语言模型（LLM）生成代码追踪问题的方法，通过设计指导 GPT4 生成基于代码片段和描述的代码追踪问题的有针对性提示，并建立了一套人工评价指标，用于评估模型生成的问题与人工专家创建的问题的质量。我们的分析揭示了 LLMs 在生成多样化代码追踪问题方面的能力和潜力，并提供了一个独特的人工和 LLM 生成的追踪问题数据集，为教育和自然语言处理研究社区提供了宝贵资源。这项工作为关于 LLMs 在教育环境中潜在用途的持续对话做出了贡献。