From grading papers to summarizing medical documents, large language models
(LLMs) are evermore used for evaluation of text generated by humans and AI
alike. However, despite their extensive utility, LLMs exhibit distinct failure
modes, necessitating a thorough audit and improvement of their text evaluation
capabilities. Here we introduce ALLURE, a systematic approach to Auditing Large
Language Models Understanding and Reasoning Errors. ALLURE involves comparing
LLM-generated evaluations with annotated data, and iteratively incorporating
instances of significant deviation into the evaluator, which leverages
in-context learning (ICL) to enhance and improve robust evaluation of text by
LLMs. Through this iterative process, we aim to refine the performance of the
evaluator LLM, ultimately reducing the reliance on human annotators in the
evaluation process. We anticipate ALLURE to serve diverse applications of LLMs
in various domains related to evaluation of textual data and productivity in
these fields.

ALLURE 是一种系统性方法，用于审计大型语言模型的理解和推理错误，通过比较 LLM 生成的评估与注释数据，并迭代地将显著偏差的实例纳入评估器，利用上下文学习（ICL）增强和改进 LLM 对文本的鲁棒评估，从而最终降低对人工注释者在评估过程中的依赖。预期 ALLURE 能在与文本数据评估和效率相关的各领域中服务多种 LLM 应用。