Classification systems are evaluated in a countless number of papers.
However, we find that evaluation practice is often nebulous. Frequently,
metrics are selected without arguments, and blurry terminology invites
misconceptions. For instance, many works use so-called 'macro' metrics to rank
systems (e.g., 'macro F1') but do not clearly specify what they would expect
from such a 'macro' metric. This is problematic, since picking a metric can
affect paper findings as well as shared task rankings, and thus any clarity in
the process should be maximized.
Starting from the intuitive concepts of bias and prevalence, we perform an
analysis of common evaluation metrics, considering expectations as found
expressed in papers. Equipped with a thorough understanding of the metrics, we
survey metric selection in recent shared tasks of Natural Language Processing.
The results show that metric choices are often not supported with convincing
arguments, an issue that can make any ranking seem arbitrary. This work aims at
providing overview and guidance for more informed and transparent metric
selection, fostering meaningful evaluation.

分类系统在无数篇论文中进行评估。然而，我们发现评估实践通常是模糊的。经常情况下，指标选择是没有依据的，模糊的术语容易引起误解。本文从偏倚和普遍性的直观概念出发，对常用的评估指标进行分析，考虑到论文中所表达的期望。通过对度量选择的全面理解，我们调查了自然语言处理的最近共享任务中的度量选择情况。结果显示，度量选择通常缺乏令人信服的论证，这可能使得任何排名看起来都是随意的。本工作旨在提供概览和指导，以实现更有见地和透明的度量选择，推动有意义的评估。

分类评估指标的深入研究及对常见评估实践的批判性反思

A Closer Look at Classification Evaluation Metrics and a Critical  Reflection of Common Evaluation Practice

Knowledge graph completion (KGC) aims to infer missing knowledge triples
based on known facts in a knowledge graph. Current KGC research mostly follows
an entity ranking protocol, wherein the effectiveness is measured by the
predicted rank of a masked entity in a test triple. The overall performance is
then given by a micro(-average) metric over all individual answer entities. Due
to the incomplete nature of the large-scale knowledge bases, such an entity
ranking setting is likely affected by unlabelled top-ranked positive examples,
raising questions on whether the current evaluation protocol is sufficient to
guarantee a fair comparison of KGC systems. To this end, this paper presents a
systematic study on whether and how the label sparsity affects the current KGC
evaluation with the popular micro metrics. Specifically, inspired by the TREC
paradigm for large-scale information retrieval (IR) experimentation, we create
a relatively "complete" judgment set based on a sample from the popular
FB15k-237 dataset following the TREC pooling method. According to our analysis,
it comes as a surprise that switching from the original labels to our
"complete" labels results in a drastic change of system ranking of a variety of
13 popular KGC models in terms of micro metrics. Further investigation
indicates that the IR-like macro(-average) metrics are more stable and
discriminative under different settings, meanwhile, less affected by label
sparsity. Thus, for KGC evaluation, we recommend conducting TREC-style pooling
to balance between human efforts and label completeness, and reporting also the
IR-like macro metrics to reflect the ranking nature of the KGC task.

本文对知识图谱补全 (KGC) 评估方法的合理性进行了深入研究，发现现有的微观度量方法在面对大规模知识库的标注不足时存在问题，而宏观度量方法更为稳健，提出 TREC-style pooling 方法可以在考虑标注完整性的同时平衡人力成本，同时建议使用宏观度量来反映 KGC 任务的排名性质。