Artificial Intelligence (AI) finds widespread applications across various domains, sparking concerns about fairness in its deployment. While fairness in AI remains a central concern, the prevailing discourse often emphasizes outcome-based metrics without a nuanced consideration of the differential impacts within subgroups. Bias mitigation techniques do not only affect the ranking of pairs of instances across sensitive groups, but often also significantly affect the ranking of instances within these groups. Such changes are hard to explain and raise concerns regarding the validity of the intervention. Unfortunately, these effects largely remain under the radar in the accuracy-fairness evaluation framework that is usually applied. This paper challenges the prevailing metrics for assessing bias mitigation techniques, arguing that they do not take into account the changes within-groups and that the resulting prediction labels fall short of reflecting real-world scenarios. We propose a paradigm shift: initially, we should focus on generating the most precise ranking for each subgroup. Following this, individuals should be chosen from these rankings to meet both fairness standards and practical considerations.

人工智能（AI）在各领域广泛应用，引发对公平性的关注。然而，现行的讨论往往强调基于结果的度量，而对亚组内不同影响缺乏细致考虑。偏见缓解技术不仅影响敏感组之间实例的排名，而且通常也显著影响这些组内实例的排名。这种变化很难解释，并引发对干预有效性的担忧。然而，当前通常使用的准确性公平评估框架很少注意到这些效应。本文挑战用于评估偏见缓解技术的现行指标，认为它们不考虑组内变化，导致预测标签无法反映真实场景。我们提出一个范式转变：首先，我们应该专注于为每个亚组生成最准确的排名。然后，根据这些排名选取个体以满足公平标准和实际考量。

超越准确性和公平性: 不再仅基于群组间指标对偏见缓解方法进行评估