Generative models trained using Differential Privacy (DP) are increasingly used to produce and share synthetic data in a privacy-friendly manner. In this paper, we set out to analyze the impact of DP on these models vis-a-vis underrepresented classes and subgroups of data. We do so from two angles: 1) the size of classes and subgroups in the synthetic data, and 2) classification accuracy on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our experiments, conducted using three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN), show that DP results in opposite size distributions in the generated synthetic data. More precisely, it affects the gap between the majority and minority classes and subgroups, either reducing it (a "Robin Hood" effect) or increasing it ("Matthew" effect). However, both of these size shifts lead to similar disparate impacts on a classifier's accuracy, affecting disproportionately more the underrepresented subparts of the data. As a result, we call for caution when analyzing or training a model on synthetic data, or risk treating different subpopulations unevenly, which might also lead to unreliable conclusions.

本研究分析了Differential Privacy对生成的合成数据的大小和准确性的影响，特别是对于数据中的少数派子群/类别。使用DP模型（PrivBayes，DP-WGAN和PATE-GAN）进行分析，发现DP导致生成的合成数据形态的不同，从而导致不同层面上的分类任务的准确性不同，影响数据中被较少关注的子部分。因此，使用合成数据训练模型有可能会出现对不同子群体不平等的风险，从而导致不可靠或不公平结果。

罗宾汉与马太效应：差分隐私对合成数据有不同影响