Neural language models (LMs) are vulnerable to training data extraction
attacks due to data memorization. This paper introduces a novel attack scenario
wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the
exposure of the original training data. This strategy differs from prior
studies by aiming to intensify the LM's retention of its pre-training dataset.
To achieve this, the attacker needs to collect generated texts that are closely
aligned with the pre-training data. However, without knowledge of the actual
dataset, quantifying the amount of pre-training data within generated texts is
challenging. To address this, we propose the use of pseudo-labels for these
generated texts, leveraging membership approximations indicated by
machine-generated probabilities from the target LM. We subsequently fine-tune
the LM to favor generations with higher likelihoods of originating from the
pre-training data, based on their membership probabilities. Our empirical
findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a
four to eight-fold increase in training data exposure. We discuss potential
mitigations and suggest future research directions.

通过对神经语言模型进行对抗性微调，以增强其对预训练数据的保留，本文介绍了一种新的攻击场景。通过使用伪标签进行生成文本的成员近似，我们证明了使用更高的成员概率进行微调能够使模型暴露训练数据增加四到八倍。

通过伪标记成员的微调增强训练数据曝光

Amplifying Training Data Exposure through Fine-Tuning with  Pseudo-Labeled Memberships

Large language models have gained significant popularity because of their
ability to generate human-like text and potential applications in various
fields, such as Software Engineering. Large language models for code are
commonly trained on large unsanitised corpora of source code scraped from the
internet. The content of these datasets is memorised and can be extracted by
attackers with data extraction attacks. In this work, we explore memorisation
in large language models for code and compare the rate of memorisation with
large language models trained on natural language. We adopt an existing
benchmark for natural language and construct a benchmark for code by
identifying samples that are vulnerable to attack. We run both benchmarks
against a variety of models, and perform a data extraction attack. We find that
large language models for code are vulnerable to data extraction attacks, like
their natural language counterparts. From the training data that was identified
to be potentially extractable we were able to extract 47% from a
CodeGen-Mono-16B code completion model. We also observe that models memorise
more, as their parameter count grows, and that their pre-training data are also
vulnerable to attack. We also find that data carriers are memorised at a higher
rate than regular code or documentation and that different model architectures
memorise different samples. Data leakage has severe outcomes, so we urge the
research community to further investigate the extent of this phenomenon using a
wider range of models and extraction techniques in order to build safeguards to
mitigate this issue.

大语言模型在编程领域备受瞩目，然而其数据源可能面临被攻击者利用数据提取攻击进行窃取的风险，本研究对大语言模型进行了代码和自然语言两方面的对比研究并发现其对数据提取攻击存在漏洞，建议进一步研究并采取相应措施来缓解此问题。

大型语言模型中的记忆痕迹对于代码的影响

Traces of Memorisation in Large Language Models for Code

Previous work has shown that Large Language Models are susceptible to
so-called data extraction attacks. This allows an attacker to extract a sample
that was contained in the training data, which has massive privacy
implications. The construction of data extraction attacks is challenging,
current attacks are quite inefficient, and there exists a significant gap in
the extraction capabilities of untargeted attacks and memorization. Thus,
targeted attacks are proposed, which identify if a given sample from the
training data, is extractable from a model. In this work, we apply a targeted
data extraction attack to the SATML2023 Language Model Training Data Extraction
Challenge. We apply a two-step approach. In the first step, we maximise the
recall of the model and are able to extract the suffix for 69% of the samples.
In the second step, we use a classifier-based Membership Inference Attack on
the generations. Our AutoSklearn classifier achieves a precision of 0.841. The
full approach reaches a score of 0.405 recall at a 10% false positive rate,
which is an improvement of 34% over the baseline of 0.301.

应用有针对性的数据提取攻击考察了 SATML2023 语言模型训练数据提取挑战，并通过两步骤的方法成功地从中提取了样本，其中第一步骤成功提取 69％的样本的后缀；接着，使用基于分类器的成员推断攻击对生成式进行检测，其 AutoSklearn 分类器在 10% 的误报率下达到了 0.841 的准确率，相对于基线提高了 34％，完整方法在 10％误报率下以 0.405 召回率得分，研究表明大型语言模型容易受到数据提取攻击，其隐私风险需要引起重视。