Pre-training, which utilizes extensive and varied datasets, is a critical
factor in the success of Large Language Models (LLMs) across numerous
applications. However, the detailed makeup of these datasets is often not
disclosed, leading to concerns about data security and potential misuse. This
is particularly relevant when copyrighted material, still under legal
protection, is used inappropriately, either intentionally or unintentionally,
infringing on the rights of the authors.
In this paper, we introduce a detailed framework designed to detect and
assess the presence of content from potentially copyrighted books within the
training datasets of LLMs. This framework also provides a confidence estimation
for the likelihood of each content sample's inclusion. To validate our
approach, we conduct a series of simulated experiments, the results of which
affirm the framework's effectiveness in identifying and addressing instances of
content misuse in LLM training processes. Furthermore, we investigate the
presence of recognizable quotes from famous literary works within these
datasets. The outcomes of our study have significant implications for ensuring
the ethical use of copyrighted materials in the development of LLMs,
highlighting the need for more transparent and responsible data management
practices in this field.

介绍了一种用于检测和评估用于大型语言模型的训练数据集中的潜在版权书籍内容的详细框架，并提供了每个内容样本包含的可信度估计。通过模拟实验证实了该框架在识别和解决语言模型训练过程中的内容滥用方面的有效性，同时研究了这些数据集中来自名著的可识别引用语的存在。研究结果对于确保版权材料在语言模型开发中的合理使用具有重要意义，强调了在该领域需要更加透明和负责任的数据管理实践。

Digger: 大型语言模型训练中侵权内容的检测

Digger: Detecting Copyright Content Mis-usage in Large Language Model  Training

Generative AI is on the rise, enabling everyone to produce realistic content
via publicly available interfaces. Especially for guided image generation,
diffusion models are changing the creator economy by producing high quality low
cost content. In parallel, artists are rising against unruly AI, since their
artwork are leveraged, distributed, and dissimulated by large generative
models. Our approach, My Art My Choice (MAMC), aims to empower content owners
by protecting their copyrighted materials from being utilized by diffusion
models in an adversarial fashion. MAMC learns to generate adversarially
perturbed "protected" versions of images which can in turn "break" diffusion
models. The perturbation amount is decided by the artist to balance distortion
vs. protection of the content. MAMC is designed with a simple UNet-based
generator, attacking black box diffusion models, combining several losses to
create adversarial twins of the original artwork. We experiment on three
datasets for various image-to-image tasks, with different user control values.
Both protected image and diffusion output results are evaluated in visual,
noise, structure, pixel, and generative spaces to validate our claims. We
believe that MAMC is a crucial step for preserving ownership information for AI
generated content in a flawless, based-on-need, and human-centric way.

利用 My Art My Choice (MAMC) 方法，通过 UNet-based 生成器，对抗扩散模型，保护版权图片不受非法使用。

我的艺术我的选择：对抗无序人工智能的防护

My Art My Choice: Adversarial Protection Against Unruly AI

In this work, we carry out a data archaeology to infer books that are known
to ChatGPT and GPT-4 using a name cloze membership inference query. We find
that OpenAI models have memorized a wide collection of copyrighted materials,
and that the degree of memorization is tied to the frequency with which
passages of those books appear on the web. The ability of these models to
memorize an unknown set of books complicates assessments of measurement
validity for cultural analytics by contaminating test data; we show that models
perform much better on memorized books than on non-memorized books for
downstream tasks. We argue that this supports a case for open models whose
training data is known.

通过数据考古，我们发现 OpenAI 模型已经记忆了大量的版权材料，并且记忆程度与这些书在网站上出现的频率相关。这些模型记忆未知书籍的能力使得文化分析的度量可靠性变得更加复杂，我们的研究表明，这些模型在记忆过的书籍上比非记忆书籍上表现得更好，这支持了开放模型的案例。