Recent machine translation (MT) metrics calibrate their effectiveness by
correlating with human judgement but without any insights about their behaviour
across different error types. Challenge sets are used to probe specific
dimensions of metric behaviour but there are very few such datasets and they
either focus on a limited number of phenomena or a limited number of language
pairs. We introduce ACES, a contrastive challenge set spanning 146 language
pairs, aimed at discovering whether metrics can identify 68 translation
accuracy errors. These phenomena range from simple alterations at the
word/character level to more complex errors based on discourse and real-world
knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics
submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metric
performance, assess their incremental performance over successive campaigns,
and measure their sensitivity to a range of linguistic phenomena. We also
investigate claims that Large Language Models (LLMs) are effective as MT
evaluators by evaluating on ACES. Our results demonstrate that different metric
families struggle with different phenomena and that LLM-based methods fail to
demonstrate reliable performance. Our analyses indicate that most metrics
ignore the source sentence, tend to prefer surface-level overlap and end up
incorporating properties of base models which are not always beneficial. We
expand ACES to include error span annotations, denoted as SPAN-ACES and we use
this dataset to evaluate span-based error metrics showing these metrics also
need considerable improvement. Finally, we provide a set of recommendations for
building better MT metrics, including focusing on error labels instead of
scores, ensembling, designing strategies to explicitly focus on the source
sentence, focusing on semantic content and choosing the right base model for
representations.

介绍了一个跨越 146 种语言对的对比挑战集 ACES，以发现度量标准是否能够识别 68 种翻译准确性错误，并通过对 WMT 2022 和 2023 度量标准共享任务中的 50 个度量标准进行基准测试，评估其渐进性能和对各种语言现象的敏感性。结果显示，不同的度量标准家族在不同的现象上存在困难，并且基于大型语言模型的方法的可靠性表现不佳。扩展了 ACES 以包括错误跨度注释，称为 SPAN-ACES，并使用该数据集评估基于跨度的错误度量，结果表明这些度量标准还需要较大改进。最后，提供了构建更好的机器翻译度量标准的一些建议，包括专注于错误标签而非分数，融合多个度量标准，设计明确专注于源句的策略，专注于语义内容，并选择适合的基本模型来进行表示。

机器翻译元评估通过翻译准确度挑战集

Machine Translation Meta Evaluation through Translation Accuracy  Challenge Sets

Neural network models have shown great success at natural language inference
(NLI), the task of determining whether a premise entails a hypothesis. However,
recent studies suggest that these models may rely on fallible heuristics rather
than deep language understanding. We introduce a challenge set to test whether
NLI systems adopt one such heuristic: assuming that a sentence entails all of
its subsequences, such as assuming that "Alice believes Mary is lying" entails
"Alice believes Mary." We evaluate several competitive NLI models on this
challenge set and find strong evidence that they do rely on the subsequence
heuristic.

本文介绍一个挑战集用以测试 NLI 系统是否使用了一个启发式方法：假设一个句子包括了其所有子序列，如 “Alice 相信 Mary 在说谎” 就包括了 “Alice 相信 Mary” 等。作者评估了几个有竞争力的 NLI 模型并发现了有力的证据说明它们确实依赖于子序列启发式方法。