Debiasing methods that seek to mitigate the tendency of Language Models (LMs) to occasionally output toxic or inappropriate text have recently gained traction. In this paper, we propose a standardized protocol which distinguishes methods that yield not only desirable results, but are also consistent with their mechanisms and specifications. For example, we ask, given a debiasing method that is developed to reduce toxicity in LMs, if the definition of toxicity used by the debiasing method is reversed, would the debiasing results also be reversed? We used such considerations to devise three criteria for our new protocol: Specification Polarity, Specification Importance, and Domain Transferability. As a case study, we apply our protocol to a popular debiasing method, Self-Debiasing, and compare it to one we propose, called Instructive Debiasing, and demonstrate that consistency is as important an aspect to debiasing viability as is simply a desirable result. We show that our protocol provides essential insights into the generalizability and interpretability of debiasing methods that may otherwise go overlooked.

该文提出了一种标准化协议来区分那些不仅产生了可取的结果，而且与它们的机制和规格一致的去偏差方法，并通过提供 essential insights 来展示了该协议对于去偏差方法的普适性和可解释性的重要性。

消除偏见的好与坏：测量语言模型中消除偏见技术的一致性