Practitioners commonly download pretrained machine learning models from open
repositories and finetune them to fit specific applications. We show that this
practice introduces a new risk of privacy backdoors. By tampering with a
pretrained model's weights, an attacker can fully compromise the privacy of the
finetuning data. We show how to build privacy backdoors for a variety of
models, including transformers, which enable an attacker to reconstruct
individual finetuning samples, with a guaranteed success! We further show that
backdoored models allow for tight privacy attacks on models trained with
differential privacy (DP). The common optimistic practice of training DP models
with loose privacy guarantees is thus insecure if the model is not trusted.
Overall, our work highlights a crucial and overlooked supply chain attack on
machine learning privacy.

预训练机器学习模型存在隐私后门的风险，攻击者能够通过篡改权重完全破坏微调数据的隐私。我们展示了如何为各种模型（包括 transformers）构建隐私后门，进而成功重构个体微调样本。此外，我们还展示了被注入后门的模型能够对使用差分隐私训练的模型进行隐私攻击。因此，如果模型不受信任，使用宽松隐私保证进行差分隐私模型训练的常见乐观实践是不安全的。总的来说，我们的工作突出了对机器学习隐私的一种关键而被忽视的供应链攻击。

隐私后门：通过已被污染的预训练模型窃取数据

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

Pretrained machine learning models are known to perpetuate and even amplify
existing biases in data, which can result in unfair outcomes that ultimately
impact user experience. Therefore, it is crucial to understand the mechanisms
behind those prejudicial biases to ensure that model performance does not
result in discriminatory behaviour toward certain groups or populations. In
this work, we define gender bias as our case study. We quantify bias
amplification in pretraining and after fine-tuning on three families of
vision-and-language models. We investigate the connection, if any, between the
two learning stages, and evaluate how bias amplification reflects on model
performance. Overall, we find that bias amplification in pretraining and after
fine-tuning are independent. We then examine the effect of continued
pretraining on gender-neutral data, finding that this reduces group
disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without
significantly compromising task performance.

在这项研究中，我们以性别偏见为案例研究，通过量化预训练和微调对三类视觉与语言模型中的偏见放大进行分析，研究了这两个学习阶段之间的联系，并评估了偏见放大对模型性能的影响。总体来说，我们发现预训练和微调中的偏见放大是相互独立的。接着，我们研究了对性别中性数据的持续预训练对 VQAv2 和检索任务的影响，发现这种方法可以减少群体间的差异并提升公平性，而不会显著影响任务性能。