Inducing and leveraging sparse activations during training and inference is a promising avenue for improving the computational efficiency of deep networks, which is increasingly important as network sizes continue to grow and their application becomes more widespread. Here we use the large width Gaussian process limit to analyze the behaviour, at random initialization, of nonlinear activations that induce sparsity in the hidden outputs. A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification; those being a shifted ReLU ($\phi(x)=\max(0, x-\tau)$ for $\tau\ge 0$) and soft thresholding ($\phi(x)=0$ for $|x|\le\tau$ and $x-\text{sign}(x)\tau$ for $|x|>\tau$). We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map. Numerical experiments verify the theory and show that the proposed magnitude clipped sparsifying activations can be trained with training and test fractional sparsity as high as 85\% while retaining close to full accuracy.

通过剪枝层来诱导和利用稀疏激活是提高深度网络计算效率的一种有前途的方法，本论文使用大尺度高斯过程极限分析了随机初始化时诱导隐藏层稀疏性的非线性激活函数，证明了一种先前未报告的培训不稳定性，并表明通过剪枝激活函数的幅度，可以克服这种不稳定性，理论验证和数值实验表明，这种剪枝激活函数能够在训练和测试时保持接近完全准确度的同时达到高达85％的稀疏度。

稀疏诱导激活的深度神经网络初始化