We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range.

我们研究了全连接前馈神经网络的批标准化问题，并提出了一种均值场理论。研究表明，批标准化会导致梯度爆炸，而这种爆炸无法通过调节初始权重方差或调整非线性激活函数来消除。然而，我们可以通过将网络调整到线性区域来减少梯度爆炸，从而提高网络的可训练性。此外，我们还研究了批标准化网络的学习动态。

批归一化的平均场理论