Social biases can manifest in language agency. For instance, White
individuals and men are often described as "agentic" and achievement-oriented,
whereas Black individuals and women are frequently described as "communal" and
as assisting roles. This study establishes agency as an important aspect of
studying social biases in both human-written and Large Language Model
(LLM)-generated texts. To accurately measure "language agency" at sentence
level, we propose a Language Agency Classification dataset to train reliable
agency classifiers. We then use an agency classifier to reveal notable language
agency biases in 6 datasets of human- or LLM-written texts, including
biographies, professor reviews, and reference letters. While most prior NLP
research on agency biases focused on single dimensions, we comprehensively
explore language agency biases in gender, race, and intersectional identities.
We observe that (1) language agency biases in human-written texts align with
real-world social observations; (2) LLM-generated texts demonstrate remarkably
higher levels of language agency bias than human-written texts; and (3)
critical biases in language agency target people of minority groups--for
instance, languages used to describe Black females exhibit the lowest level of
agency across datasets. Our findings reveal intricate social biases in human-
and LLM-written texts through the lens of language agency, warning against
using LLM generations in social contexts without scrutiny.

通过语言表达的机构性来研究社会偏见，探究人类编写文本和大型语言模型（LLM）生成文本中的社会偏见，并通过验证数据集和分类器揭示不同领域的语言机构性偏见。研究结果表明，在性别、种族和交叉身份方面，人类编写的文本中存在与现实社会观察一致的语言机构性偏见；与人类编写的文本相比，LLM 生成的文本中的语言机构性偏见更为显著；而针对少数群体的语言机构性偏见尤为严重，例如描述黑人女性的语言在各个数据集中表现出最低的机构性水平。因此，本研究通过语言机构性的视角揭示了人类和 LLM 生成的文本中复杂的社会偏见，警示在社交环境中使用 LLM 生成文本时应审慎对待。

白人男性主导，黑人女性协助：揭示语言代理中的性别、种族和交叉偏见

White Men Lead, Black Women Help: Uncovering Gender, Racial, and  Intersectional Bias in Language Agency

This paper investigates the radioactivity of LLM-generated texts, i.e.
whether it is possible to detect that such input was used as training data.
Conventional methods like membership inference can carry out this detection
with some level of accuracy. We show that watermarked training data leaves
traces easier to detect and much more reliable than membership inference. We
link the contamination level to the watermark robustness, its proportion in the
training set, and the fine-tuning process. We notably demonstrate that training
on watermarked synthetic instructions can be detected with high confidence
(p-value < 1e-5) even when as little as 5% of training text is watermarked.
Thus, LLM watermarking, originally designed for detecting machine-generated
text, gives the ability to easily identify if the outputs of a watermarked LLM
were used to fine-tune another LLM.

调查了 LLM 生成的文本的辐射性，即是否可能检测到这种输入被用作训练数据；与成员推断等传统方法相比，我们发现水印训练数据留下的痕迹更容易检测且更可靠；我们将污染程度与水印的鲁棒性、在训练集中所占比例和微调过程联系起来；我们特别证明，即使仅有 5％的训练文本带有水印，也能以高置信度（p 值 < 1e-5）检测到在带有水印的合成指令上进行训练；因此，最初设计用于检测机器生成文本的 LLM 水印技术可以轻松识别是否使用带有水印的 LLM 的输出来进行微调。

水印技术使语言模型放射性增强

Watermarking Makes Language Models Radioactive

Large Language Models (LLMs) have achieved human-level fluency in text
generation, making it difficult to distinguish between human-written and
LLM-generated texts. This poses a growing risk of misuse of LLMs and demands
the development of detectors to identify LLM-generated texts. However, existing
detectors degrade detection accuracy by simply paraphrasing LLM-generated
texts. Furthermore, the effectiveness of these detectors in real-life
situations, such as when students use LLMs for writing homework assignments
(e.g., essays) and quickly learn how to evade these detectors, has not been
explored. In this paper, we propose OUTFOX, a novel framework that improves the
robustness of LLM-generated-text detectors by allowing both the detector and
the attacker to consider each other's output and apply this to the domain of
student essays. In our framework, the attacker uses the detector's prediction
labels as examples for in-context learning and adversarially generates essays
that are harder to detect. While the detector uses the adversarially generated
essays as examples for in-context learning to learn to detect essays from a
strong attacker. Our experiments show that our proposed detector learned
in-context from the attacker improves the detection performance on the attacked
dataset by up to +41.3 point F1-score. While our proposed attacker can
drastically degrade the performance of the detector by up to -57.0 point
F1-score compared to the paraphrasing method.

提出 OUTFOX 框架，通过允许检测器和攻击者考虑彼此的输出来提高 LLM 生成文本检测器的鲁棒性，并将其应用于学生作文领域。