We address the fundamental challenge in Natural Language Generation (NLG) model evaluation, the design and validation of evaluation metrics. Recognizing the limitations of existing metrics and issues with human judgment, we propose using measurement theory, the foundation of test design, as a framework for conceptualizing and evaluating the validity and reliability of NLG evaluation metrics. This approach offers a systematic method for defining "good" metrics, developing robust metrics, and assessing metric performance. In this paper, we introduce core concepts in measurement theory in the context of NLG evaluation and key methods to evaluate the performance of NLG metrics. Through this framework, we aim to promote the design, evaluation, and interpretation of valid and reliable metrics, ultimately contributing to the advancement of robust and effective NLG models in real-world settings.

本文提出了一种基于测试设计的方法，用于概念化和评估自然语言生成评价指标的可靠性和有效性，并介绍了关于测量理论的核心概念及评估自然语言生成指标性能的关键方法。通过该框架的使用，本研究旨在促进设计、评估和解释可靠和有效的指标，最终为实际应用中健壮和效果良好的自然语言生成模型的提升做出贡献。

评估自然语言生成评价指标：基于测量理论视角