Classification systems are evaluated in a countless number of papers.
However, we find that evaluation practice is often nebulous. Frequently,
metrics are selected without arguments, and blurry terminology invites
misconceptions. For instance, many works use so-called 'macro' metrics to rank
systems (e.g., 'macro F1') but do not clearly specify what they would expect
from such a 'macro' metric. This is problematic, since picking a metric can
affect paper findings as well as shared task rankings, and thus any clarity in
the process should be maximized.
Starting from the intuitive concepts of bias and prevalence, we perform an
analysis of common evaluation metrics, considering expectations as found
expressed in papers. Equipped with a thorough understanding of the metrics, we
survey metric selection in recent shared tasks of Natural Language Processing.
The results show that metric choices are often not supported with convincing
arguments, an issue that can make any ranking seem arbitrary. This work aims at
providing overview and guidance for more informed and transparent metric
selection, fostering meaningful evaluation.

分类系统在无数篇论文中进行评估。然而，我们发现评估实践通常是模糊的。经常情况下，指标选择是没有依据的，模糊的术语容易引起误解。本文从偏倚和普遍性的直观概念出发，对常用的评估指标进行分析，考虑到论文中所表达的期望。通过对度量选择的全面理解，我们调查了自然语言处理的最近共享任务中的度量选择情况。结果显示，度量选择通常缺乏令人信服的论证，这可能使得任何排名看起来都是随意的。本工作旨在提供概览和指导，以实现更有见地和透明的度量选择，推动有意义的评估。

分类评估指标的深入研究及对常见评估实践的批判性反思

A Closer Look at Classification Evaluation Metrics and a Critical  Reflection of Common Evaluation Practice

We establish THumB, a rubric-based human evaluation protocol for image
captioning models. Our scoring rubrics and their definitions are carefully
developed based on machine- and human-generated captions on the MSCOCO dataset.
Each caption is evaluated along two main dimensions in a tradeoff (precision
and recall) as well as other aspects that measure the text quality (fluency,
conciseness, and inclusive language). Our evaluations demonstrate several
critical problems of the current evaluation practice. Human-generated captions
show substantially higher quality than machine-generated ones, especially in
coverage of salient information (i.e., recall), while most automatic metrics
say the opposite. Our rubric-based results reveal that CLIPScore, a recent
metric that uses image features, better correlates with human judgments than
conventional text-only metrics because it is more sensitive to recall. We hope
that this work will promote a more transparent evaluation protocol for image
captioning and its automatic metrics.

本文介绍了一种基于机器和人生成的 MSCOCO 数据集上的图像标注模型的评估协议 THumB，用于评估图像文本的质量。我们的实验发现，使用图像特征的近期度量值 CLIPScore 更符合人类评判标准。