Machine learning models are widely used but can also often be wrong. Users
would benefit from a reliable indication of whether a given output from a given
model should be trusted, so a rational decision can be made whether to use the
output or not. For example, outputs can be associated with a confidence
measure; if this confidence measure is strongly associated with likelihood of
correctness, then the model is said to be well-calibrated. In this case, for
example, high-confidence outputs could be safely accepted, and low-confidence
outputs rejected.
Calibration has so far been studied in non-generative (e.g., classification)
settings, especially in Software Engineering. However, generated code can quite
often be wrong: Developers need to know when they should e.g., directly use,
use after careful review, or discard model-generated code; thus Calibration is
vital in generative settings. However, the notion of correctness of generated
code is non-trivial, and thus so is Calibration. In this paper we make several
contributions. We develop a framework for evaluating the Calibration of
code-generating models. We consider several tasks, correctness criteria,
datasets, and approaches, and find that by and large generative code models are
not well-calibrated out of the box. We then show how Calibration can be
improved, using standard methods such as Platt scaling. Our contributions will
lead to better-calibrated decision-making in the current use of code generated
by language models, and offers a framework for future research to further
improve calibration methods for generative models in Software Engineering.

该论文介绍了一种评估生成模型校准性的框架，并发现大部分生成代码模型在校准性方面表现不佳。通过使用 Platt 缩放等标准方法，可以改善校准性，从而提供更准确的决策支持和为未来研究提供校准方法的框架。

LLM 生成代码的质量与信任

Quality and Trust in LLM-generated Code

Instruction-tuning has become an integral part of training pipelines for
Large Language Models (LLMs) and has been shown to yield strong performance
gains. In an orthogonal line of research, Annotation Error Detection (AED) has
emerged as a tool for detecting quality issues of gold-standard labels. But so
far, the application of AED methods is limited to discriminative settings. It
is an open question how well AED methods generalize to generative settings
which are becoming widespread via generative LLMs. In this work, we present a
first and new benchmark for AED on instruction-tuning data: Donkii. It
encompasses three instruction-tuning datasets enriched with annotations by
experts and semi-automatic methods. We find that all three datasets contain
clear-cut errors that sometimes directly propagate into instruction-tuned LLMs.
We propose four AED baselines for the generative setting and evaluate them
comprehensively on the newly introduced dataset. Our results demonstrate that
choosing the right AED method and model size is indeed crucial, thereby
deriving practical recommendations. To gain insights, we provide a first
case-study to examine how the quality of the instruction-tuning datasets
influences downstream performance.

在这项研究中，我们提出了一个新的 AED 基准测试：Donkii，它包含了三个经过专家和半自动方法注释的指导调整数据集。我们发现这三个数据集中包含明显的错误，有时直接传播到指导调整的 LLMs 中。我们提出了四个适用于生成设置的 AED 基准，并在新引入的数据集上进行了全面评估。我们的结果表明选择正确的 AED 方法和模型大小确实至关重要，从而得出了实际建议。为了获得更多见解，我们提供了第一个案例研究，以检查指导调整数据集的质量对下游性能的影响。