Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or silver labels. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.

在这篇论文中，我们通过在关键词提取任务上的评估，展示了集成不一致分数作为语言模型在零样本、少样本和微调设置下人类标注的代理的良好效果。通过与真实错误进行比较，我们发现，不一致分数比使用另一个语言模型作为机器标签或银标签，更好地估计了模型的性能，其平均误差率低至0.4％，平均比使用银标签提高了13.8％。

有效的人工标注代理: 工业自然语言处理中大型语言模型的集成不一致得分