BriefGPT.xyz
Jun, 2024
LLM数据推断:你在我的数据集上训练了吗?
LLM Dataset Inference: Did you train on my dataset?
HTML
PDF
Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic
TL;DR
大语言模型在现实世界中的大量使用产生了对公司以未经许可的方式在互联网上训练模型的版权纠纷。本文提出了一种新的数据集推断方法来准确识别用于训练大语言模型的数据集,成功地区分了不同子集的Pile数据集的训练集和测试集,无任何错误的正例。
Abstract
The proliferation of
large language models
(LLMs) in the real world has come with a rise in
copyright cases
against companies for training their models on unlicensed data from the internet. Recent works have pres
→