Large language models (LLMs) have demonstrated remarkable capabilities out of
box for a wide range of applications, yet accuracy still remains a major growth
area, especially in mission-critical domains such as biomedicine. An effective
method to calibrate the confidence level on LLM responses is essential to
automatically detect errors and facilitate human-in-the-loop verification. An
important source of calibration signals stems from expert-stipulated
programmatic supervision, which is often available at low cost but has its own
limitations such as noise and coverage. In this paper, we introduce a Pareto
optimal self-supervision framework that can leverage available programmatic
supervision to systematically calibrate LLM responses by producing a risk score
for every response, without any additional manual efforts. This is accomplished
by learning a harmonizer model to align LLM output with other available
supervision sources, which would assign higher risk scores to more uncertain
LLM responses and facilitate error correction. Experiments on standard relation
extraction tasks in biomedical and general domains demonstrate the promise of
this approach, with our proposed risk scores highly correlated with the real
error rate of LLMs. For the most uncertain test instances, dynamic prompting
based on our proposed risk scores results in significant accuracy improvement
for off-the-shelf LLMs, boosting GPT-3 results past state-of-the-art (SOTA)
weak supervision and GPT-4 results past SOTA supervised results on challenging
evaluation datasets.

本文提出了一种 Pareto 最优的自我监督框架，该框架可以利用可用的程序监督来系统地校准 LLM 响应，为每个响应产生风险分数，从而不需要进行任何额外的人工努力。

通过帕累托最优自我监督实现大型语言模型的自动校准和误差修正

Automatic Calibration and Error Correction for Large Language Models via  Pareto Optimal Self-Supervision

Risk scores are simple classification models that let users make quick risk
predictions by adding and subtracting a few small numbers. These models are
widely used in medicine and criminal justice, but are difficult to learn from
data because they need to be calibrated, sparse, use small integer
coefficients, and obey application-specific operational constraints. In this
paper, we present a new machine learning approach to learn risk scores. We
formulate the risk score problem as a mixed integer nonlinear program, and
present a cutting plane algorithm for non-convex settings to efficiently
recover its optimal solution. We improve our algorithm with specialized
techniques to generate feasible solutions, narrow the optimality gap, and
reduce data-related computation. Our approach can fit risk scores in a way that
scales linearly in the number of samples, provides a certificate of optimality,
and obeys real-world constraints without parameter tuning or post-processing.
We benchmark the performance benefits of this approach through an extensive set
of numerical experiments, comparing to risk scores built using heuristic
approaches. We also discuss its practical benefits through a real-world
application where we build a customized risk score for ICU seizure prediction
in collaboration with the Massachusetts General Hospital.

通过机器学习方法中标度、稀疏、整数系数和适应特定应用限制的混合整数非线性规划的切割平面算法来改进现有的风险评分模型，并更好地适应实际应用场景，比传统启发式方法更具优势和实际意义。