The conventional paradigm of using large language models (LLMs) for evaluating natural language generation (NLG) systems typically relies on two key inputs: (1) a clear definition of the NLG task to be evaluated and (2) a list of pre-defined evaluation criteria. This process treats LLMs as ''passive critics,'' strictly following human-defined criteria for evaluation. However, as new NLG tasks emerge, the criteria for assessing text quality can vary greatly. Consequently, these rigid evaluation methods struggle to adapt to diverse NLG tasks without extensive prompt engineering customized for each specific task. To address this limitation, we introduce Active-Critic, a novel LLM-based NLG evaluation protocol that enables LLMs to function as ''active critics.'' Specifically, our protocol comprises two key stages. In the first stage, the LLM is instructed to infer the target NLG task and establish relevant evaluation criteria from the data. Building on this self-inferred information, the second stage dynamically optimizes the prompt to guide the LLM toward more human-aligned scoring decisions, while also generating detailed explanations to justify its evaluations. Experiments across four NLG evaluation tasks show that our approach achieves stronger alignment with human judgments than state-of-the-art evaluation methods. Our comprehensive analysis further highlights the effectiveness and explainability of Active-Critic with only a small amount of labeled data. We will share our code and data on GitHub.

本研究解决了当前自然语言生成评估中，使用大型语言模型作为“被动批评者”的局限性，提出了一种新颖的“积极批评者”评估协议。该协议允许大型语言模型自我推断任务并动态优化评估标准，实现了与人类评估标准的更强一致性，并在多个评估任务中展现出其有效性和可解释性。

大型语言模型在自然语言生成评估中的积极批评者