The remarkable capabilities and easy accessibility of large language models
(LLMs) have significantly increased societal risks (e.g., fake news
generation), necessitating the development of LLM-generated text (LGT)
detection methods for safe usage. However, detecting LGTs is challenging due to
the vast number of LLMs, making it impractical to account for each LLM
individually; hence, it is crucial to identify the common characteristics
shared by these models. In this paper, we draw attention to a common feature of
recent powerful LLMs, namely the alignment training, i.e., training LLMs to
generate human-preferable texts. Our key finding is that as these aligned LLMs
are trained to maximize the human preferences, they generate texts with higher
estimated preferences even than human-written texts; thus, such texts are
easily detected by using the reward model (i.e., an LLM trained to model human
preference distribution). Based on this finding, we propose two training
schemes to further improve the detection ability of the reward model, namely
(i) continual preference fine-tuning to make the reward model prefer aligned
LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a
rephrased texts from human-written texts using aligned LLMs), which serves as a
median preference text corpus between LGTs and human-written texts to learn the
decision boundary better. We provide an extensive evaluation by considering six
text domains across twelve aligned LLMs, where our method demonstrates
state-of-the-art results. Code is available at
this https URL

通过对大型语言模型的对齐训练以及奖励模型的检测能力，本文提出了两种训练方案用于提高对 LGM 生成文本的检测效果，并在六个文本领域的十二个对齐模型上进行了广泛的评估，展示了最先进的结果。

ReMoDetect：奖励模型识别对齐 LLM 的生成

ReMoDetect: Reward Models Recognize Aligned LLM's Generations

In this paper, we study the problem of watermarking large language models
(LLMs). We consider the trade-off between model distortion and detection
ability and formulate it as a constrained optimization problem based on the
green-red algorithm of Kirchenbauer et al. (2023a). We show that the optimal
solution to the optimization problem enjoys a nice analytical property which
provides a better understanding and inspires the algorithm design for the
watermarking process. We develop an online dual gradient ascent watermarking
algorithm in light of this optimization formulation and prove its asymptotic
Pareto optimality between model distortion and detection ability. Such a result
guarantees an averaged increased green list probability and henceforth
detection ability explicitly (in contrast to previous results). Moreover, we
provide a systematic discussion on the choice of the model distortion metrics
for the watermarking problem. We justify our choice of KL divergence and
present issues with the existing criteria of ``distortion-free'' and
perplexity. Finally, we empirically evaluate our algorithms on extensive
datasets against benchmark algorithms.

本文研究了大型语言模型（LLMs）的水印问题，并将其模型畸变和检测能力之间的权衡视为一个基于 Kirchenbauer 等人（2023a）的绿 - 红算法的约束优化问题。通过该优化问题的最优解，我们证明了其具有良好的解析特性，从而更好地理解并启发了水印过程的算法设计。在此优化公式的基础上，我们开发了一种在线对偶梯度上升水印算法，并证明了其在模型畸变和检测能力之间的渐近帕累托最优性。这样的结果保证了平均增加的绿色列表概率和因此明确的检测能力（与之前的结果相比）。此外，我们对水印问题中模型畸变度量的选择进行了系统讨论。我们证明了选择 KL 散度的合理性，并介绍了现有的 “无畸变” 和困惑度标准存在的问题。最后，我们通过对广泛数据集的对比算法进行了实证评估。