Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen's Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE's ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.

本研究解决了自由形式生成的语言模型响应评估中的一个关键问题，即传统评估方法无法有效捕捉语义等价性。我们提出了动态仲裁框架（DAFE），通过使用多个大型语言模型作为评估者，以提升评估的准确性，并显著改善评估指标。该框架在一致性、可扩展性和资源效率方面表现出色，展示了其在评估自由形式输出中的潜在影响。

DAFE：基于大型语言模型的动态仲裁自由问答评估