Large language models (LLMs) have become integral to our professional
workflows and daily lives. Nevertheless, these machine companions of ours have
a critical flaw: the huge amount of data which endows them with vast and
diverse knowledge, also exposes them to the inevitable toxicity and bias. While
most LLMs incorporate defense mechanisms to prevent the generation of harmful
content, these safeguards can be easily bypassed with minimal prompt
engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity
(TET) dataset, comprising manually crafted prompts designed to nullify the
protective layers of such models. Through extensive evaluations, we demonstrate
the pivotal role of TET in providing a rigorous benchmark for evaluation of
toxicity awareness in several popular LLMs: it highlights the toxicity in the
LLMs that might remain hidden when using normal prompts, thus revealing subtler
issues in their behavior.

该研究介绍了新的 “全面优化毒性”（TET）数据集，由手工设计的提示构成，旨在抵消这些模型的保护层，通过广泛的评估，证明了 TET 在评估几种流行的 LLMs 中毒性意识方面的重要作用，凸显了正常提示下可能隐藏的 LLMs 中的毒性，从而揭示了它们行为中更微妙的问题。