One of the generally accepted views of modern deep learning is that increasing the number of parameters usually leads to better quality. The two easiest ways to increase the number of parameters is to increase the size of the network, e.g. width, or to train a deep ensemble; both approaches improve the performance in practice. In this work, we consider a fixed memory budget setting, and investigate, what is more effective: to train a single wide network, or to perform a memory split -- to train an ensemble of several thinner networks, with the same total number of parameters? We find that, for large enough budgets, the number of networks in the ensemble, corresponding to the optimal memory split, is usually larger than one. Interestingly, this effect holds for the commonly used sizes of the standard architectures. For example, one WideResNet-28-10 achieves significantly worse test accuracy on CIFAR-100 than an ensemble of sixteen thinner WideResNets: 80.6% and 82.52% correspondingly. We call the described effect the Memory Split Advantage and show that it holds for a variety of datasets and model architectures.

该研究考虑在固定内存预算设置下，在训练单个宽网络或训练一组细网络之间，性能哪种更有效。研究发现，对于足够大的预算，采用内存分割，即训练一组较薄的网络，通常比训练单个宽网络更为有效。该发现被称为“内存分割优势”，适用于各种数据集和模型架构。

在固定的内存预算下进行深度集成：一种宽网络或多个较窄的网络？