This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.

该研究论文介绍了一种创新的K均值聚类算法，该算法通过整合并行处理、随机抽样和竞争优化等方法，实现了适用于大数据应用的可扩展变体。算法通过动态调整每个工作器的样本大小来优化性能，并且通过在不同样本大小的工作器之间引入竞争机制，进一步提高了Big-means算法的效率。同时，在并行计算环境下采用了随机、竞争抽样策略，使得算法在计算时间和聚类质量之间取得平衡。

大规模均值算法中通过竞争随机样本大小优化实现卓越的并行大数据聚类