Many large-scale machine learning problems -- clustering, non-parametric learning, kernel machines, etc. -- require selecting a small yet representative subset from a large dataset. Such problems can often be reduced to maximizing a submodular set function subject to various constraints. Classical approaches to submodular optimization require centralized access to the full dataset, which is impractical for truly large-scale problems. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDi, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show that under certain natural conditions, performance close to the centralized approach can be achieved. We begin with monotone submodular maximization subject to a cardinality constraint, and then extend this approach to obtain approximation guarantees for (not necessarily monotone) submodular maximization subject to more general constraints including matroid or knapsack constraints.In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop.

本文提出了一种适用于分布式计算的子模函数最大化方法GreeDi，该方法可在MapReduce框架下实现，初步实验表明该方法可应用于大规模机器学习任务中的子模优化问题，如稀疏高斯过程推断和样例聚类等问题，且在一定的自然条件下，可以达到接近于传统集中式计算模式下的性能表现。

分布式子模最大化