Evaluation of large language models (LLMs) for code has primarily relied on
static benchmarks, including HumanEval (Chen et al., 2021), which measure the
ability of LLMs to generate complete code that passes unit tests. As LLMs are
increasingly used as programmer assistants, we study whether gains on existing
benchmarks translate to gains in programmer productivity when coding with LLMs,
including time spent coding. In addition to static benchmarks, we investigate
the utility of preference metrics that might be used as proxies to measure LLM
helpfulness, such as code acceptance or copy rates. To do so, we introduce
RealHumanEval, a web interface to measure the ability of LLMs to assist
programmers, through either autocomplete or chat support. We conducted a user
study (N=213) using RealHumanEval in which users interacted with six LLMs of
varying base model performance. Despite static benchmarks not incorporating
humans-in-the-loop, we find that improvements in benchmark performance lead to
increased programmer productivity; however gaps in benchmark versus human
performance are not proportional -- a trend that holds across both forms of LLM
support. In contrast, we find that programmer preferences do not correlate with
their actual performance, motivating the need for better, human-centric proxy
signals. We also open-source RealHumanEval to enable human-centric evaluation
of new models and the study data to facilitate efforts to improve code models.

通过使用 RealHumanEval、静态基准以及优先度度量，研究了大型语言模型（LLMs）在代码编写中的效能表现以及对程序员生产力的影响。发现优化的基准性能可以提高程序员的生产力，但基准性能与人类表现之间的差距并不成比例，同时程序员的偏好与实际表现并无关联，这促使我们需要更好、以人为中心的评估指标。同时，我们公开了 RealHumanEval 工具和研究数据以促进代码模型的改进。