Can we efficiently extract useful information from a large user-generated dataset while protecting the privacy of the users and/or ensuring fairness in representation. We cast this problem as an instance of a deletion-robust submodular maximization where part of the data may be deleted due to privacy concerns or fairness criteria. We propose the first memory-efficient centralized, streaming, and distributed methods with constant-factor approximation guarantees against any number of adversarial deletions. We extensively evaluate the performance of our algorithms against prior state-of-the-art on real-world applications, including (i) Uber-pick up locations with location privacy constraints; (ii) feature selection with fairness constraints for income prediction and crime rate prediction; and (iii) robust to deletion summarization of census data, consisting of 2,458,285 feature vectors.

本文主要研究如何在保护用户隐私和确保公平性的同时，高效地从大型用户生成的数据集中提取有用信息。该问题被描述成删除鲁棒子模型最大化的一个实例，我们提出了第一种内存高效的集中式、流式和分布式方法，对任意数量的敌对删除具有常数逼近保证。我们对我们的算法在真实世界的应用进行了广泛评估，包括：（i）具有位置隐私限制的Uber接送点；（ii）收入预测和犯罪率预测的公平性限制的特征选择；以及(iii)由2,458,285个特征向量组成的人口普查数据的删除汇总结果的健壮性。

大规模下具有删除鲁棒性的子模最大化