As Large Language Models (LLMs) become an important way of information
seeking, there have been increasing concerns about the unethical content LLMs
may generate. In this paper, we conduct a rigorous evaluation of LLMs' implicit
bias towards certain groups by attacking them with carefully crafted
instructions to elicit biased responses. Our attack methodology is inspired by
psychometric principles in cognitive and social psychology. We propose three
attack approaches, i.e., Disguise, Deception, and Teaching, based on which we
built evaluation datasets for four common bias types. Each prompt attack has
bilingual versions. Extensive evaluation of representative LLMs shows that 1)
all three attack methods work effectively, especially the Deception attacks; 2)
GLM-3 performs the best in defending our attacks, compared to GPT-3.5 and
GPT-4; 3) LLMs could output content of other bias types when being taught with
one type of bias. Our methodology provides a rigorous and effective way of
evaluating LLMs' implicit bias and will benefit the assessments of LLMs'
potential ethical risks.

大型语言模型（LLMs）的普及引发了对其可能产生的不道德内容的增加关注。本文通过利用精心设计的指令进行攻击，以评估 LLMs 对特定群体的潜在偏见。我们提出了三种攻击方法（伪装、欺骗和教授），并构建了四种常见偏见类型的评估数据集。对典型 LLMs 进行了广泛评估，结果显示：1）所有三种攻击方法都非常有效，特别是欺骗攻击；2）GLM-3 在防御我们的攻击方面表现最佳，相比之下 GPT-3.5 和 GPT-4 则较差；3）当以一种偏见类型进行教授时，LLMs 可能会输出其他类型的内容。我们的方法提供了一种可靠而有效的评估 LLMs 潜在偏见的方式，并有助于评估 LLMs 的潜在伦理风险。