Critique ability are crucial in the scalable oversight and self-improvement
of Large Language Models (LLMs). While many recent studies explore the critique
ability of LLMs to judge and refine flaws in generations, how to
comprehensively and reliably measure the critique abilities of LLMs is
under-explored. This paper introduces \shortname, a novel benchmark designed to
comprehensively and reliably evaluate four key critique ability dimensions of
LLMs: feedback, comparison, refinement and meta-feedback.
\shortname~encompasses nine diverse tasks, each assessing the LLMs' ability to
critique responses at varying levels of quality granularity. Our extensive
evaluations of open-source and closed-source LLMs reveal intriguing
relationships between the critique ability and tasks, response qualities, and
model scales. Datasets, resources and evaluation toolkit for \shortname~will be
publicly released at https://github.com/gmftbyGMFTBY/CriticBench.

论文介绍了一种用于全面可靠评估大型语言模型 (Large Language Models) 的批评能力的新的基准，该基准包括九个不同的任务，评估了语言模型在不同质量粒度下的批评响应能力，并揭示了批评能力与任务、响应质量和模型规模之间的有趣关系。