Large Language Models (LLMs) are gaining popularity among software engineers.
A crucial aspect of developing effective code-generation LLMs is to evaluate
these models using a robust benchmark. Evaluation benchmarks with quality
issues can provide a false sense of performance. In this work, we conduct the
first-of-its-kind study of the quality of prompts within benchmarks used to
compare the performance of different code generation models. To conduct this
study, we analyzed 3,566 prompts from 9 code generation benchmarks to identify
quality issues in them. We also investigated whether fixing the identified
quality issues in the benchmarks' prompts affects a model's performance. We
also studied memorization issues of the evaluation dataset, which can put into
question a benchmark's trustworthiness. We found that code generation
evaluation benchmarks mainly focused on Python and coding exercises and had
very limited contextual dependencies to challenge the model. These datasets and
the developers' prompts suffer from quality issues like spelling and
grammatical errors, unclear sentences to express developers' intent, and not
using proper documentation style. Fixing all these issues in the benchmarks can
lead to a better performance for Python code generation, but not a significant
improvement was observed for Java code generation. We also found evidence that
GPT-3.5-Turbo and CodeGen-2.5 models possibly have data contamination issues.

评估大型语言模型在代码生成方面的效果时，需要使用健全的基准测试，而不严谨的评估基准会提供虚假的性能表现。本研究分析了 9 个代码生成基准中的 3,566 个提示，以确定其中的质量问题，并研究了修复这些问题对模型性能的影响。发现评估基准主要侧重于 Python 和编码练习，且缺乏上下文依赖关系，同时还存在拼写和语法错误、表达不清晰以及不符合适当文档规范等质量问题。修复这些问题可以提高 Python 代码生成的性能，但对 Java 代码生成的改进不明显。此外，还发现 GPT-3.5-Turbo 和 CodeGen-2.5 模型可能存在数据污染问题。

代码生成中使用的提示的质量评估

Quality Assessment of Prompts Used in Code Generation

In recent years, the use of automated source code generation utilizing
transformer-based generative models has expanded, and these models can generate
functional code according to the requirements of the developers. However,
recent research revealed that these automatically generated source codes can
contain vulnerabilities and other quality issues. Despite researchers' and
practitioners' attempts to enhance code generation models, retraining and
fine-tuning large language models is time-consuming and resource-intensive.
Thus, we describe FRANC, a lightweight framework for recommending more secure
and high-quality source code derived from transformer-based code generation
models. FRANC includes a static filter to make the generated code compilable
with heuristics and a quality-aware ranker to sort the code snippets based on a
quality score. Moreover, the framework uses prompt engineering to fix
persistent quality issues. We evaluated the framework with five Python and Java
code generation models and six prompt datasets, including a newly created one
in this work (SOEval). The static filter improves 9% to 46% Java suggestions
and 10% to 43% Python suggestions regarding compilability. The average
improvement over the NDCG@10 score for the ranking system is 0.0763, and the
repairing techniques repair the highest 80% of prompts. FRANC takes, on
average, 1.98 seconds for Java; for Python, it takes 0.08 seconds.

FRANC 是一个轻量级框架，用于推荐从基于 Transformer 的代码生成模型生成的更安全、更高质量的源代码，其中包括静态筛选器，质量感知排名器和提示工程学。在五个 Python 和 Java 代码生成模型和六个提示数据集上进行了评估，静态筛选器可以使 Java 的建议准确性提高 46％，Python 的建议准确性提高 43％。

高质量代码生成的轻量级框架

A Lightweight Framework for High-Quality Code Generation

It is often overseen that AI-enabled systems are also software systems and
therefore rely on software quality assurance (SQA). Thus, the goal of this
study is to investigate the software quality assurance strategies adopted
during the development, integration, and maintenance of AI/ML components and
code. We conducted semi-structured interviews with representatives of ten
Austrian SMEs that develop AI-enabled systems. A qualitative analysis of the
interview data identified 12 issues in the development of AI/ML components.
Furthermore, we identified when quality issues arise in AI/ML components and
how they are detected. The results of this study should guide future work on
software quality assurance processes and techniques for AI/ML components.

研究调查了 10 家奥地利中小型企业在开发人工智能 / 机器学习组件和代码时采用的软件质量保证策略，发现质量问题在何时出现并如何检测，同时识别了 12 个开发 AI/ML 组件时存在的问题，为未来 AI/ML 组件的软件质量保证流程提供指导。