寻找最先进状态时，请勿忘记差异和置信区间

May, 2022

寻找最先进状态时，请勿忘记差异和置信区间

Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status

Yves Bestgen

TL;DR本文主张广泛使用自助法置信区间来比较自然语言处理系统的性能，而非使用最先进技术（SOTA）和统计显著性检验。两个案例分析展示了其主要优点，即突出两个系统性能差异并帮助评估一个系统优于另一个系统的程度。同时还提供了一个用于获取这些置信区间的 Python 模块，以及一个用于实现配对样本的 Fisher-Pitman 检验的第二个函数，这些功能在 PyPI 上都是免费提供的。

Abstract

This paper argues for the widest possible use of bootstrap confidence intervals for comparing nlp system performances instead of the state-of-the-art status (SOTA) and →