In this paper we revisit automatic metrics for paraphrase evaluation and
obtain two findings that disobey conventional wisdom: (1) Reference-free
metrics achieve better performance than their reference-based counterparts. (2)
Most commonly used metrics do not align well with human annotation. Underlying
reasons behind the above findings are explored through additional experiments
and in-depth analyses. Based on the experiments and analyses, we propose
ParaScore, a new evaluation metric for paraphrase generation. It possesses the
merits of reference-based and reference-free metrics and explicitly models
lexical divergence. Experimental results demonstrate that ParaScore
significantly outperforms existing metrics.

本文重新审视了用于复述评估的自动评估度量，并得出两个违背常规智慧的发现：(1) 无参考度量比基于参考文本的度量具有更好的性能。 (2) 人类注释与使用最多的度量不太相符。通过额外的实验证明和深入的分析探讨了上述发现背后的原因。 基于实验和分析，我们提出了 ParaScore，这是一种新的复述生成评估指标。它具有基于参考的和无参考的指标的优点，并明确地建模词汇差异。实验结果证明，ParaScore 显着优于现有的指标。