Batch normalization has multiple benefits. It improves the conditioning of the loss landscape, and is a surprisingly effective regularizer. However, the most important benefit of batch normalization arises in residual networks, where it dramatically increases the largest trainable depth. We identify the origin of this benefit: At initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor proportional to the square root of the network depth. This ensures that, early in training, the function computed by deep normalized residual networks is dominated by shallow paths with well-behaved gradients. We use this insight to develop a simple initialization scheme which can train very deep residual networks without normalization. We also clarify that, although batch normalization does enable stable training with larger learning rates, this benefit is only useful when one wishes to parallelize training over large batch sizes. Our results help isolate the distinct benefits of batch normalization in different architectures.

通过初始化时将剩余分支与跳过连接相比，批归一化可以缩小深度神经网络的剩余分支，从而通过将归一化因子放在网络深度的平方根上，确保在训练早期，深度网络中的标准化剩余块计算的函数接近于恒等函数，这是批归一化可以显着提高残差网络最大可训练深度的关键原因之一，并且已经关键地促成了深度残差网络在广泛的基准测试上的实证成功。同时，我们还提出了一种不需要归一化即可训练深度残差网络的简单初始化方案，并且对残差网络进行了详细的实证研究，阐明了虽然批归一化网络可以使用更高的学习率进行训练，但这种影响只有在特定计算范围内才是有利的，并且在批大小较小时几乎没有任何好处。

批量归一化使深度网络中残差块偏向于恒等函数