Evaluation of large language models (LLMs) for code has primarily relied on
static benchmarks, including HumanEval (Chen et al., 2021), which measure the
ability of LLMs to generate complete code that passes unit tests. As LLMs are
increasingly used as programmer assistants, we study whether gains on existing
benchmarks translate to gains in programmer productivity when coding with LLMs,
including time spent coding. In addition to static benchmarks, we investigate
the utility of preference metrics that might be used as proxies to measure LLM
helpfulness, such as code acceptance or copy rates. To do so, we introduce
RealHumanEval, a web interface to measure the ability of LLMs to assist
programmers, through either autocomplete or chat support. We conducted a user
study (N=213) using RealHumanEval in which users interacted with six LLMs of
varying base model performance. Despite static benchmarks not incorporating
humans-in-the-loop, we find that improvements in benchmark performance lead to
increased programmer productivity; however gaps in benchmark versus human
performance are not proportional -- a trend that holds across both forms of LLM
support. In contrast, we find that programmer preferences do not correlate with
their actual performance, motivating the need for better, human-centric proxy
signals. We also open-source RealHumanEval to enable human-centric evaluation
of new models and the study data to facilitate efforts to improve code models.

通过使用 RealHumanEval、静态基准以及优先度度量，研究了大型语言模型（LLMs）在代码编写中的效能表现以及对程序员生产力的影响。发现优化的基准性能可以提高程序员的生产力，但基准性能与人类表现之间的差距并不成比例，同时程序员的偏好与实际表现并无关联，这促使我们需要更好、以人为中心的评估指标。同时，我们公开了 RealHumanEval 工具和研究数据以促进代码模型的改进。

RealHumanEval: 评估大型语言模型对程序员的支持能力

The RealHumanEval: Evaluating Large Language Models' Abilities to  Support Programmers

Code Large Language Models (Code LLMs) have demonstrated outstanding
performance in code-related tasks. Several instruction tuning approaches have
been proposed to boost the code generation performance of pre-trained Code
LLMs. In this paper, we introduce a diverse instruction model (DolphCoder) with
self-evaluating for code generation. It learns diverse instruction targets and
combines a code evaluation objective to enhance its code generation ability.
Our model achieves superior performance on the HumanEval and MBPP benchmarks,
demonstrating new insights for future code instruction tuning work. Our key
findings are: (1) Augmenting more diverse responses with distinct reasoning
paths increases the code capability of LLMs. (2) Improving one's ability to
evaluate the correctness of code solutions also enhances their ability to
create it.

通过引入一种具有自我评估功能的多样指令模型 (DolphCoder) 来增强预训练的 Code LLM 的代码生成性能，实现了卓越的 HumanEval 和 MBPP 基准性能，为未来的代码指令调优工作提供了新的见解。

DolphCoder: 用多目标指令调整为特征的大型语言模型进行回声定位编码

DolphCoder: Echo-Locating Code Large Language Models with Diverse and  Multi-Objective Instruction Tuning

Generating code from a natural language using Large Language Models (LLMs)
such as ChatGPT, seems groundbreaking. Yet, with more extensive use, it's
evident that this approach has its own limitations. The inherent ambiguity of
natural language presents challenges for complex software designs. Accordingly,
our research offers an Agile Model-Driven Development (MDD) approach that
enhances code auto-generation using OpenAI's GPT-4. Our work emphasizes
"Agility" as a significant contribution to the current MDD method, particularly
when the model undergoes changes or needs deployment in a different programming
language. Thus, we present a case-study showcasing a multi-agent simulation
system of an Unmanned Vehicle Fleet. In the first and second layer of our
approach, we constructed a textual representation of the case-study using
Unified Model Language (UML) diagrams. In the next layer, we introduced two
sets of constraints that minimize model ambiguity. Object Constraints Language
(OCL) is applied to fine-tune the code constructions details, while FIPA
ontology is used to shape communication semantics and protocols. Ultimately,
leveraging GPT-4, our last layer auto-generates code in both Java and Python.
The Java code is deployed within the JADE framework, while the Python code is
deployed in PADE framework. Concluding our research, we engaged in a
comprehensive evaluation of the generated code. From a behavioural standpoint,
the auto-generated code aligned perfectly with the expected UML sequence
diagram. Structurally, we compared the complexity of code derived from UML
diagrams constrained solely by OCL to that influenced by both OCL and
FIPA-ontology. Results indicate that ontology-constrained model produce
inherently more intricate code, but it remains manageable and low-risk for
further testing and maintenance.

使用大型语言模型（LLMs）如 ChatGPT 从自然语言中生成代码似乎是开创性的。然而，随着更广泛的使用，显然这种方法有自己的局限性。本研究提出了一种敏捷模型驱动开发（MDD）方法，使用 OpenAI 的 GPT-4 来增强代码自动生成。我们的工作强调 “敏捷性” 是对当前 MDD 方法的重要贡献，特别是当模型发生变化或需要部署到不同的编程语言时。因此，我们展示了一个案例研究，展示了无人驾驶车队的多代理仿真系统。在我们的方法的第一层和第二层，我们使用统一建模语言（UML）图示构建了案例研究的文本表示。在下一层中，我们引入了两组约束，以最小化模型的歧义性。对象约束语言（OCL）被应用于微调代码构建细节，而 FIPA 本体论用于塑造通信语义和协议。最后，利用 GPT-4，我们的最后一层自动生成 Java 和 Python 两种代码。Java 代码在 JADE 框架中部署，而 Python 代码在 PADE 框架中部署。在研究的结论部分，我们进行了对生成代码的全面评估。从行为角度来看，自动生成的代码与预期的 UML 顺序图完全一致。结构上，我们比较了仅受 OCL 约束的从 UML 图中导出的代码与既受 OCL 又受 FIPA 本体论影响的代码的复杂性。结果表明，本体论约束的模型产生了固有更复杂的代码，但仍然可管理并且对进一步的测试和维护风险较低。