LLM-based automated program repair methods have attracted significant attention for their state-of-the-art performance. However, they were primarily evaluated on a few well known datasets like Defects4J, raising questions about their effectiveness on new datasets. In this study, we evaluate 11 top-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming Defects4J while maintaining the original semantics. Results from experiments on both Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited generalizability in APR tasks, with the average number of correct and plausible patches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS. Further investigation into incorporating additional repair-relevant information in repair prompts reveals that, although this information significantly enhances the LLMs' capabilities (increasing the number of correct and plausible patches by up to 136.67% and 121.82%, respectively), performance still falls short of their original results. This indicates that prompt engineering alone is insufficient to substantially enhance LLMs' repair capabilities. Based on our study, we also offer several recommendations for future research.

本研究针对大型语言模型（LLMs）在自动程序修复（APR）任务中的泛化能力进行评估，发现这些模型在新的数据集DEFECTS4J-TRANS上的表现显著下降，正确和合理的修复数量分别减少了49.48%和42.90%。尽管引入修复相关信息能够提升模型能力，但整体性能仍未达到原有结果，表明仅靠提示工程无法显著改善LLMs的修复能力。

评估大型语言模型在自动程序修复中的泛化能力