As research on machine translation moves to translating text beyond the
sentence level, it remains unclear how effective automatic evaluation metrics
are at scoring longer translations. In this work, we first propose a method for
creating paragraph-level data for training and meta-evaluating metrics from
existing sentence-level data. Then, we use these new datasets to benchmark
existing sentence-level metrics as well as train learned metrics at the
paragraph level. Interestingly, our experimental results demonstrate that using
sentence-level metrics to score entire paragraphs is equally as effective as
using a metric designed to work at the paragraph level. We speculate this
result can be attributed to properties of the task of reference-based
evaluation as well as limitations of our datasets with respect to capturing all
types of phenomena that occur in paragraph-level translations.

机器翻译中，自动评估指标在评分更长的翻译文本方面的有效性仍不清楚。本文提出了一种通过现有句子级数据创建段落级数据用于训练和元评估指标的方法，并利用这些新数据集对现有句子级指标进行基准测试，以及在段落级训练学习指标。有趣的是，我们的实验结果表明，使用句子级指标评分整个段落与使用专为段落级工作的指标同样有效。我们推测这一结果可能归因于基于参考的评估任务的特性以及数据集在捕捉段落级翻译中发生的各种现象方面的局限性。

在段落级别上训练和元评估机器翻译评估指标

Training and Meta-Evaluating Machine Translation Evaluation Metrics at  the Paragraph Level

Being able to rank the similarity of short text segments is an interesting
bonus feature of neural machine translation. Translation-based similarity
measures include direct and pivot translation probability, as well as
translation cross-likelihood, which has not been studied so far. We analyze
these measures in the common framework of multilingual NMT, releasing the
NMTScore library (available at this https URL). Compared
to baselines such as sentence embeddings, translation-based measures prove
competitive in paraphrase identification and are more robust against
adversarial or multilingual input, especially if proper normalization is
applied. When used for reference-based evaluation of data-to-text generation in
2 tasks and 17 languages, translation-based measures show a relatively high
correlation to human judgments.

本研究基于多语言神经机器翻译的框架，通过分析直接和间接翻译概率以及交叉似然度量的相似度评估方法，研究了这些方法在短文本相似度评估中的性能，提出并实现了基于翻译的相似度评估方法库 NMTScore，并在两项数据生成任务和 17 种语言上进行了基于参考评价的实验，结果表明该方法较其他方法更具鲁棒性，并与人类判断有较高的相关性。

NMTScore: 基于翻译的文本相似度测量方法的多语言分析

NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures

A desirable property of a reference-based evaluation metric that measures the
content quality of a summary is that it should estimate how much information
that summary has in common with a reference. Traditional text overlap based
metrics such as ROUGE fail to achieve this because they are limited to matching
tokens, either lexically or via embeddings. In this work, we propose a metric
to evaluate the content quality of a summary using question-answering (QA).
QA-based methods directly measure a summary's information overlap with a
reference, making them fundamentally different than text overlap metrics. We
demonstrate the experimental benefits of QA-based metrics through an analysis
of our proposed metric, QAEval. QAEval out-performs current state-of-the-art
metrics on most evaluations using benchmark datasets, while being competitive
on others due to limitations of state-of-the-art models. Through a careful
analysis of each component of QAEval, we identify its performance bottlenecks
and estimate that its potential upper-bound performance surpasses all other
automatic metrics, approaching that of the gold-standard Pyramid Method.

提出一种基于问答的评估度量标准（QAEval）来评估摘要的内容质量，通过分析 QAEval，证明 QA-based methods 相较于传统的基于文本内部匹配的度量标准（如 ROUGE）更加准确。