Deep image clustering methods are typically evaluated on small-scale balanced classification datasets while feature-based $k$-means has been applied on proprietary billion-scale datasets. In this work, we explore the performance of feature-based deep clustering approaches on large-scale benchmarks whilst disentangling the impact of the following data-related factors: i) class imbalance, ii) class granularity, iii) easy-to-recognize classes, and iv) the ability to capture multiple classes. Consequently, we develop multiple new benchmarks based on ImageNet21K. Our experimental analysis reveals that feature-based $k$-means is often unfairly evaluated on balanced datasets. However, deep clustering methods outperform $k$-means across most large-scale benchmarks. Interestingly, $k$-means underperforms on easy-to-classify benchmarks by large margins. The performance gap, however, diminishes on the highest data regimes such as ImageNet21K. Finally, we find that non-primary cluster predictions capture meaningful classes (i.e. coarser classes).

在大规模基准数据集上，探索了基于特征的深度聚类方法的性能表现，并分析了数据相关因素对其影响，包括类别不平衡、类别粒度、易于识别的类别和捕获多类别的能力。通过基于ImageNet21K的多个新基准评估实验，发现基于特征的$k$-means在平衡数据集上评估不公平，而深度聚类方法在大多数大规模基准数据集上表现优于$k$-means。有趣的是，$k$-means在易于分类的基准上表现不佳，但在最高数据规模（如ImageNet21K）上的性能差距减小。最后，发现非主要聚类预测能够捕获有意义的类别（即更粗略的类别）。

超越ImageNet-1K的深度聚类方法的扩展