In this paper, we introduce Ranger - a toolkit to facilitate the easy use of
effect-size-based meta-analysis for multi-task evaluation in NLP and IR. We
observed that our communities often face the challenge of aggregating results
over incomparable metrics and scenarios, which makes conclusions and take-away
messages less reliable. With Ranger, we aim to address this issue by providing
a task-agnostic toolkit that combines the effect of a treatment on multiple
tasks into one statistical evaluation, allowing for comparison of metrics and
computation of an overall summary effect. Our toolkit produces
publication-ready forest plots that enable clear communication of evaluation
results over multiple tasks. Our goal with the ready-to-use Ranger toolkit is
to promote robust, effect-size-based evaluation and improve evaluation
standards in the community. We provide two case studies for common IR and NLP
settings to highlight Ranger's benefits.

本文介绍了 Ranger 工具箱，通过元分析来解决 NLP 和 IR 应用中聚合不可比度指标的问题，从而为多个任务的统计评价提供了一个任务不可知的工具箱。

Ranger: 基于效应大小的多任务评估工具

Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation

We call on the Document AI (DocAI) community to reevaluate current
methodologies and embrace the challenge of creating more practically-oriented
benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to
remediate the halted research progress in understanding visually-rich documents
(VRDs). We present a new dataset with novelties related to types of questions,
answers, and document layouts based on multi-industry, multi-domain, and
multi-page VRDs of various origins, and dates. Moreover, we are pushing the
boundaries of current methods by creating multi-task and multi-domain
evaluation setups that more accurately simulate real-world situations where
powerful generalization and adaptation under low-resource settings are desired.
DUDE aims to set a new standard as a more practical, long-standing benchmark
for the community, and we hope that it will lead to future extensions and
contributions that address real-world challenges. Finally, our work illustrates
the importance of finding more efficient ways to model language, images, and
layout in DocAI.

本文探讨文档人工智能 (Document AI) 社区重新评估当前的方法学，并挑战创建更具实际意义的基准标准的任务。文档理解数据集和评估 (DUDE) 旨在纠正在理解视觉丰富文档方面的研究进展。我们提出了新的数据集，其中包含来自各个行业、领域和多页的视觉丰富文档的各种问题、答案和布局。此外，我们通过创建多任务和多领域的评估设置来推动当前方法的界限，以更准确地模拟真实世界中强大的泛化和适应低资源环境的需求。通过 DUDE 旨在为社区树立一个更实用、长期的基准标准，并希望它将导致未来的扩展和贡献，以应对实际挑战。最后，我们的工作说明了在文档人工智能中寻找更有效的方式来对语言、图像和布局进行建模的重要性。