By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged. On the one hand, this is desirable as it treats all classes, rare to frequent, equally. On the other hand, it ignores cross-category confidence calibration, a key property in real-world use cases. Unfortunately, we find that on imbalanced, large-vocabulary datasets, the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors. In fact, we show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin. To address these limitations, we introduce two complementary metrics. First, we present a simple fix to the default AP implementation, ensuring that it is truly independent across categories as originally intended. We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation, suggesting recent improvements may arise from difficult to interpret changes to cross-category rankings. Given the importance of reliably benchmarking cross-category rankings, we consider a pooled version of AP (AP-pool) that rewards properly calibrated detectors by directly comparing cross-category rankings. Finally, we revisit classical approaches for calibration and find that explicitly calibrating detectors improves state-of-the-art on AP-pool by 1.7 points.

本文提出两种互补度量标准来解决在大词汇量和高实例数条件下，AP-Pool 评价指标的缺陷。我们发现，在对交叉类别进行再排序的情况下，这种缺陷形成了可操纵的指标，简单的再排序策略可以大幅提高平均精确度，而我们的新评估表明，许多报告的进展并未转化为改进，并考虑一种加权平均精确度的池化版本来奖励逐类别排序的合理性。最后，通过显式校准检测器，我们重新审视了经典的校准方法，并发现在 AP-Pool 上显式校准检测器可以将其最先进的水平提高1.7个点。

评估大词汇物体检测器：细节决定成败