We present a general approach towards controllable societal biases in natural language generation (NLG). Building upon the idea of adversarial triggers, we develop a method to induce or avoid biases in generated text containing mentions of specified demographic groups. We then analyze two scenarios: 1) inducing biases for one demographic and avoiding biases for another, and 2) mitigating biases between demographic pairs (e.g., man and woman). The former scenario gives us a tool for detecting the types of biases present in the model, and the latter is useful for mitigating biases in downstream applications (e.g., dialogue generation). Specifically, our approach facilitates more explainable biases by allowing us to 1) use the relative effectiveness of inducing biases for different demographics as a new dimension for bias evaluation, and 2) discover topics that correspond to demographic inequalities in generated text. Furthermore, our mitigation experiments exemplify our technique's effectiveness at equalizing the amount of biases across demographics while simultaneously generating less negatively biased text overall.

我们提出了一种通用方法来控制自然语言生成中的社会偏见。通过对特定人口群体进行输入提示的提及，我们开发了一种诱发社会偏见的方法，并对两种情况进行了分析：在一种人口群体中诱发负面偏见，同时在另一种人口群体中诱发正面偏见，并使偏见在不同人口群体之间相等。该方法被证明在减轻偏见过程中是有效的。

语言生成中的可控偏见