Coresets are efficient representations of datasets such that models trained on a coreset are provably competitive with models trained on the original dataset. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive datasets. However, until now, the algorithms and corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence, Itakura-Saito distance and relative entropy. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our results recover existing coreset constructions for K-Means and Gaussian mixture models and imply polynomial time approximations schemes for various hard clustering problems.

提出了一种使用Bregman差异构建强核心集的单一实用算法，可用于广泛的硬聚类和软聚类问题，并演示了该算法的实用性。

用于硬和软 Bregman 聚类的强核心集及其在指数族混合模型中的应用