Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the inclusion of copyrighted materials in their training data without proper attribution or licensing, which falls under the broader issue of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated data generated by another LLM. To address this issue, we propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct a pivotal statistic, determine the optimal rejection threshold, and explicitly control the type I and type II errors. Furthermore, we establish the asymptotic optimality properties of the proposed tests, and demonstrate its empirical effectiveness through intensive numerical experiments.

本文针对大语言模型（LLM）训练中出现的数据不当使用问题，提出了一种通过在版权训练数据中嵌入水印的方法进行检测的框架。研究中构建了一个统计检验框架，优化拒绝阈值，并控制了第一类和第二类错误，从而验证了该方法在实际应用中的有效性，具有重要的隐私保护和法律合规价值。

大语言模型中的数据不当使用检测的统计假设检验框架