Several challenges make it difficult for sparse neural networks to compete
with dense models. First, setting a large fraction of weights to zero impairs
forward and gradient signal propagation. Second, sparse studies often need to
test multiple sparsity levels, while also introducing new hyperparameters
(HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to
re-use the learning HPs originally crafted for dense models. Unfortunately, we
show sparse and dense networks do not share the same optimal HPs. Without
stable dynamics and effective training recipes, it is costly to test sparsity
at scale, which is key to surpassing dense networks and making the business
case for sparsity acceleration in hardware. A holistic approach is needed to
tackle these challenges and we propose S$\mu$Par as one such approach.
S$\mu$Par ensures activations, gradients, and weight updates all scale
independently of sparsity level. Further, by reparameterizing the HPs,
S$\mu$Par enables the same HP values to be optimal as we vary both sparsity
level and model width. HPs can be tuned on small dense networks and transferred
to large sparse models, greatly reducing tuning costs. On large-scale language
modeling, S$\mu$Par training improves loss by up to 8.2% over the common
approach of using the dense model standard parameterization.

通过重新参数化超参数，SμPar 可以在不同的稀疏度级别和模型宽度变化时实现相同的最优超参数值，以解决稀疏神经网络的挑战，并在大规模语言建模中实现高达 8.2％的损失改进。