TL;DR本研究提出了一种简单的 Mixture of Experts 模型,在大规模、多标签的预测任务中表现良好,适用于数据集分布不均、单个 GPU 存储不下的情形,并支持并行训练和统一的特征嵌入空间。该模型的表现表明可以用来训练更大的深度学习模型,拥有更强的处理能力。
Abstract
Training convolutional networks (CNN's) that fit on a single gpu with minibatch stochastic gradient descent has become effective in practice. However, there is still no effective method for training large CNN's t