The clustering of unlabeled raw images is a daunting task, which has recently been approached with some success by deep learning methods. Here we propose an unsupervised clustering framework, which learns a deep neural network in an end-to-end fashion, providing direct cluster assignments of images without additional processing. Multi-Modal Deep Clustering (MMDC), trains a deep network to align its image embeddings with target points sampled from a Gaussian Mixture Model distribution. The cluster assignments are then determined by mixture component association of image embeddings. Simultaneously, the same deep network is trained to solve an additional self-supervised task. This pushes the network to learn more meaningful image representations and stabilizes the training. Experimental results show that MMDC achieves or exceeds state-of-the-art performance on five challenging benchmarks. On natural image datasets we improve on previous results with significant margins of up to 11% absolute accuracy points, yielding an accuracy of 70% on CIFAR-10, 31% on CIFAR-100 and 61% on STL-10.

提出了一种无监督聚类框架，使用深度神经网络进行端到端的学习，从而直接对图像进行聚类分配，同时通过自我监督任务来获得更有意义的图像特征表示，实验结果表明，该方法在六个具有挑战性的基准测试中取得了优异的成果。

多模态深度聚类：图像的无监督分区