In light of recent plagiarism allegations Brough by publishers, newspapers, and other creators of copyrighted corpora against large language model (LLM) developers, we propose a novel system, a variant of a plagiarism detection system, that assesses whether a knowledge source has been used in the training or fine-tuning of a large language model. Unlike current methods, we utilize an approach that uses Resource Description Framework (RDF) triples to create knowledge graphs from both a source document and a LLM continuation of that document. These graphs are then analyzed with respect to content using cosine similarity and with respect to structure using a normalized version of graph edit distance that shows the degree of isomorphism. Unlike traditional systems that focus on content matching and keyword identification between a source and target corpus, our approach enables a broader evaluation of similarity and thus a more accurate comparison of the similarity between a source document and LLM continuation by focusing on relationships between ideas and their organization with regards to others. Additionally, our approach does not require access to LLM metrics like perplexity that may be unavailable in closed large language modeling "black-box" systems, as well as the training corpus. A prototype of our system will be found on a hyperlinked GitHub repository.

我们提出了一种新颖的系统，是一个剽窃检测系统的变种，用于评估知识源是否在大型语言模型的训练或微调中使用。与现有方法不同，我们利用资源描述框架（RDF）三元组从源文件和大型语言模型的延续中创建知识图。通过使用余弦相似度和归一化版本的图编辑距离分析这些图像的内容和结构，显示了同构的程度。此外，我们的方法不需要访问LLM指标，如迷惑度，因为在封闭的大型语言建模“黑盒”系统中可能无法获得此类指标，也不需要访问训练语料库。我们系统的原型将在一个带有超链接的GitHub存储库中找到。

通过知识图谱比较确保大型语言模型训练数据的负责任采购