Two main approaches for evaluating the quality of machine-generated rationales are: 1) using human rationales as a gold standard; and 2) automated metrics based on how rationales affect model behavior. An open question, however, is how human rationales fare with these automatic metrics. Analyzing a variety of datasets and models, we find that human rationales do not necessarily perform well on these metrics. To unpack this finding, we propose improved metrics to account for model-dependent baseline performance. We then propose two methods to further characterize rationale quality, one based on model retraining and one on using "fidelity curves" to reveal properties such as irrelevance and redundancy. Our work leads to actionable suggestions for evaluating and characterizing rationales.

本研究提出了两种度量机器生成的理由质量的方法，分别是使用人工理由作为金标准和基于理由对模型行为的影响的自动化指标。然而，人工理由与这些自动化指标的匹配程度仍存在问题。我们提出了改进的指标来解决这个问题，并提出了两种方法来进一步表征理由质量，一种是基于模型再训练，另一种是使用“保真度曲线”来揭示诸如无关性和冗余性等特性。我们的研究提出了实际的建议，以评估和表征理由。

评估和表征人类理由