Researchers have devised numerous ways to quantify social biases vested in pretrained language models. As some language models are capable of generating coherent completions given a set of textual prompts, several prompting datasets have been proposed to measure biases between social groups -- posing language generation as a way of identifying biases. In this opinion paper, we analyze how specific choices of prompt sets, metrics, automatic tools and sampling strategies affect bias results. We find out that the practice of measuring biases through text completion is prone to yielding contradicting results under different experiment settings. We additionally provide recommendations for reporting biases in open-ended language generation for a more complete outlook of biases exhibited by a given language model. Code to reproduce the results is released under https://github.com/feyzaakyurek/bias-textgen.

本文分析了影响社会偏见结果的文本补全的具体选择、度量、自动工具和抽样策略，发现在不同的实验设置下，测量偏见的实践很容易产生相互矛盾的结果，并提供了有关开放式语言生成中报告偏见的建议，从而更完整地展示给定语言模型所展示的偏见。

通过开放式语言生成测量偏见的挑战