Large Language Models (LLMs) have made significant advancements in the field
of code generation, offering unprecedented support for automated programming
and assisting developers. However, LLMs sometimes generate code that appears
plausible but fails to meet the expected requirements or executes incorrectly.
This phenomenon of hallucinations in the coding field has not been explored. To
advance the community's understanding and research on code hallucinations in
LLMs, we propose a definition method for these hallucinations based on
execution verification and introduce the concept of code hallucinations for the
first time. We categorize code hallucinations into four main types: mapping,
naming, resource, and logic hallucinations, each further divided into different
subcategories to better understand and address the unique challenges faced by
LLMs during code generation. To systematically evaluate code hallucinations, we
propose a dynamic detection algorithm for code hallucinations and construct the
CodeHalu benchmark, which includes 8,883 samples from 699 tasks, to actively
detect hallucination phenomena in LLMs during programming. We tested 16 popular
LLMs on this benchmark to evaluate the frequency and nature of their
hallucinations during code generation. The findings reveal significant
variations in the accuracy and reliability of LLMs in generating code,
highlighting the urgent need to improve models and training methods to ensure
the functional correctness and safety of automatically generated code. This
study not only classifies and quantifies code hallucinations but also provides
insights for future improvements in LLM-based code generation research. The
CodeHalu benchmark and code are publicly available at
this https URL

大型语言模型在代码生成领域取得了显著的进展，为自动化编程和开发人员提供了前所未有的支持。然而，大型语言模型有时生成的代码虽然看似合理，但无法满足预期要求或执行不正确。本研究提出了基于执行验证的代码幻觉定义方法，并首次引入了代码幻觉的概念，将代码幻觉分为映射、命名、资源和逻辑四种主要类型，以更好地理解和解决大型语言模型在代码生成过程中面临的独特挑战。我们提出了一种动态检测算法和构建了 CodeHalu 基准测试集，该测试集包括来自 699 个任务的 8,883 个样本，用于主动检测大型语言模型在编程过程中的幻觉现象。我们在该基准测试集上测试了 16 个流行的大型语言模型，评估了它们在代码生成过程中幻觉的频率和性质。研究结果揭示了大型语言模型在生成代码方面准确性和可靠性方面的显著差异，强调了改进模型和训练方法以确保自动生成代码的功能正确性和安全性的紧迫需求。本研究不仅对代码幻觉进行了分类和量化，还为基于大型语言模型的代码生成研究提供了改进的见解。CodeHalu 基准测试集和代码可在此 https URL 上公开获取。

CodeHalu: 基于执行验证的 LLMs 驱动的代码幻觉

CodeHalu: Code Hallucinations in LLMs Driven by Execution-based  Verification

Advancing automated programming necessitates robust and comprehensive code
generation benchmarks, yet current evaluation frameworks largely neglect
object-oriented programming (OOP) in favor of functional programming (FP),
e.g., HumanEval and MBPP. To address this, our study introduces a pioneering
OOP-focused benchmark, featuring 431 Python programs that encompass essential
OOP concepts and features like classes and encapsulation methods. We propose a
novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k
measures. Our evaluation of 23 leading large language models (LLMs), including
both general and code-specialized models, reveals three key insights: 1) pass@o
offers a more relevant and comprehensive assessment for OOP code generation; 2)
Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP
compared to models like ChatGPT; 3) The poor performance of all advanced LLMs
on our OOP benchmark highlights a critical need for improvements in this field.
Our benchmark and scripts are publicly released at:
this https URL

推动自动化编程需要强大和全面的代码生成基准，然而当前的评估框架在功能编程 (FP) 方面相对而言忽视了面向对象编程 (OOP)，本研究引入了一个面向对象编程的开创性基准，包含了 431 个涵盖关键的 OOP 概念和特性的 Python 程序，并提出了一个新的针对 OOP 的评估指标 pass@o，改进了传统的 pass@k 度量，研究结果表明 pass@o 为 OOP 代码生成提供了更相关和全面的评估，专注于代码的语言模型在功能编程方面表现出色，但在 OOP 方面则不及 ChatGPT 等模型，对所有高级代码语言模型在面向对象编程基准上的不良表现突显了这一领域需要的改进。