增强大语言模型评估：混淆技巧

Nov, 2024

Enhancing LLM Evaluations: The Garbling Trick

William F. Bradley

TL;DR本文解决了传统大语言模型（LLM）评估指标饱和的问题，提出了一种将现有评估转化为一系列逐步加难任务的新方法。研究结果揭示了不同模型之间的推理能力差异，尤其对OpenAI的o1-preview和Google的gemini-pro-1.5-002模型进行了有效的比较。

Abstract

As Large Language Models (LLMs) become increasingly powerful, traditional Evaluation Metrics tend to saturate, making it challenging to distinguish between models based on their performance. We propose a general