Human evaluations are often required for abstractive summary evaluations to
give fairer judgments. However, they are often time-consuming, costly,
inconsistent, and non-reproducible. To overcome these challenges, we explore
the potential of using an out-of-the-box LLM (i.e. "gpt-3.5-turbo") for
summarization evaluation without manually selecting demonstrations or complex
prompt tuning. We compare different evaluation methods, including 2 methods for
Likert-scale scoring and 1 method for head-to-head comparisons, to investigate
the performance of the LLM as a zero-shot evaluator. We further propose a
meta-correlation metric to measure the stability of the LLM's evaluation
capability. With extensive experiments, we show that certain prompt formats can
produce better results than others. We also bring attention to the LLM's
deteriorating evaluation capability with the rising qualities of summaries. In
addition, we find that the LLM's evaluation capability also depends on the
evaluated dimensions. We discuss the pros and cons of each method, make
recommendations, and suggest some future directions for improvement.

本文旨在探讨使用 LLMS（例如 “gpt-3.5-turbo”）作为自动评估器来评估摘要的性能，并比较了不同的评估方法和提示格式对其评估能力的影响。作者建议哪些提示格式可以提高 LLM 的性能，并讨论了 LLM 的评估能力随摘要质量和评估维度的变化。