We provide a theoretical explanation for the superb performance of ResNet via the study of deep linear networks and some nonlinear variants. We show that with or without nonlinearities, by adding shortcuts that have depth two, the condition number of the Hessian of the loss function at the zero initial point is depth-invariant, which makes training very deep models no more difficult than shallow ones. Shortcuts of higher depth result in an extremely flat (high-order) stationary point initially, from which the optimization algorithm is hard to escape. The 1-shortcut, however, is essentially equivalent to no shortcuts. Extensive experiments are provided accompanying our theoretical results. We show that initializing the network to small weights with 2-shortcuts achieves significantly better results than random Gaussian (Xavier) initialization, orthogonal initialization, and shortcuts of deeper depth, from various perspectives ranging from final loss, learning dynamics and stability, to the behavior of the Hessian along the learning process.

ResNet是一种残差网络，利用快捷连接显著减少了训练的难度，同时在训练和泛化误差方面都实现了很好的性能提升，我们提供了快捷连接 2 的独特理论解释，它可以使训练非常深的模型与浅的模型一样容易，同时我们的实验证明了通过使用快捷连接 2 进行小权重初始化，可以从不同的角度（最终损失、学习动态和稳定性，以及沿着学习过程的海森矩阵的行为）实现显着更好的结果。

ResNet解密