Large language models produce human-like text that drive a growing number of
applications. However, recent literature and, increasingly, real world
observations, have demonstrated that these models can generate language that is
toxic, biased, untruthful or otherwise harmful. Though work to evaluate
language model harms is under way, translating foresight about which harms may
arise into rigorous benchmarks is not straightforward. To facilitate this
translation, we outline six ways of characterizing harmful text which merit
explicit consideration when designing new benchmarks. We then use these
characteristics as a lens to identify trends and gaps in existing benchmarks.
Finally, we apply them in a case study of the Perspective API, a toxicity
classifier that is widely used in harm benchmarks. Our characteristics provide
one piece of the bridge that translates between foresight and effective
evaluation.

大型语言模型生成的文本在越来越多的应用程序中表现得像人类一样，但是最近的文献和实际观察表明，这些模型可以生成有毒，偏见，不真实或有害的语言。本文提出了六种方式来表征有害文本，并应用于现有基准和案例研究，为有害文本的评估提供了有效的方法。

有害文本的特征：走向对语言模型严格基准测试

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

Text data can pose a risk of harm. However, the risks are not fully
understood, and how to handle, present, and discuss harmful text in a safe way
remains an unresolved issue in the NLP community. We provide an analytical
framework categorising harms on three axes: (1) the harm type (e.g.,
misinformation, hate speech or racial stereotypes); (2) whether a harm is
\textit{sought} as a feature of the research design if explicitly studying
harmful content (e.g., training a hate speech classifier), versus
\textit{unsought} if harmful content is encountered when working on unrelated
problems (e.g., language generation or part-of-speech tagging); and (3) who it
affects, from people (mis)represented in the data to those handling the data
and those publishing on the data. We provide advice for practitioners, with
concrete steps for mitigating harm in research and in publication. To assist
implementation we introduce \textsc{HarmCheck} -- a documentation standard for
handling and presenting harmful text in research.

本文介绍了一个将 NLP 中的文本有害信息分为三个轴的分析框架，提供了处理和呈现有害文本的建议，并引入了使用文档标准来处理和呈现有害文本的方法。

自然语言处理研究中有害文本的处理和展现

Handling and Presenting Harmful Text in NLP Research

Language models trained on large-scale unfiltered datasets curated from the
open web acquire systemic biases, prejudices, and harmful views from their
training data. We present a methodology for programmatically identifying and
removing harmful text from web-scale datasets. A pretrained language model is
used to calculate the log-likelihood of researcher-written trigger phrases
conditioned on a specific document, which is used to identify and filter
documents from the dataset. We demonstrate that models trained on this filtered
dataset exhibit lower propensity to generate harmful text, with a marginal
decrease in performance on standard language modeling benchmarks compared to
unfiltered baselines. We provide a partial explanation for this performance gap
by surfacing examples of hate speech and other undesirable content from
standard language modeling benchmarks. Finally, we discuss the generalization
of this method and how trigger phrases which reflect specific values can be
used by researchers to build language models which are more closely aligned
with their values.

提出一种从网页规模数据集中识别和过滤有害文本的方法，使用预训练语言模型计算特定文档条件下研究员编写的触发词组的对数似然，并根据该结果识别和过滤数据集中的文档，证明在过滤后的数据集上训练的语言模型产生有害文本的倾向更低，性能与未过滤基线相比略有降低，最后探讨了此方法的推广前景及其对语言模型值域的对齐性方面的作用。