Recently, the large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. It has been a common practice of software engineers to consult LLMs when encountering coding questions. Although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom LLMs have not yet been thoroughly studied. The executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development.The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes, etc.To make things worse, the users of LLM code generation services are actually the developers that are most vulnerable to these code that seems right -- They are always novice developers that are not familiar with the APIs that LLMs generate code for them. Therefore, they could hardly tell the misuse in the code generated by LLMs, which further facilitates the incorrect code applied in real-world software. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, in this work, we propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. We collect 1208 coding questions from StackOverflow on 24 representative Java APIs. We summarize thecommon misuse patterns of these APIs and evaluate them oncurrent popular LLMs. The evaluation results show that evenfor GPT-4, 62% of the generated code contains API misuses,which would cause unexpected consequences if the code isintroduced into real-world software.

最近，大型语言模型 (LLMs) 在理解自然语言和生成编程代码方面表现出了非凡的能力。然而，对于LLMs生成的代码的可靠性和鲁棒性的研究尚未得到深入的探讨。这项研究提出了一个包括1208个编程问题的数据集RobustAPI，用于评估LLMs生成的代码的可靠性和鲁棒性，并发现甚至对于GPT-4而言，62%的生成代码存在API误用，这可能导致意想不到的后果。

大型语言模型代码生成的鲁棒性和可靠性研究