The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

本文研究在深度神经网络的梯度优化中最具影响力的超参数选择之一——初始参数值的选择，分析了不同初始化方案的具体影响，证明了从正交组中绘制初始权重相对于具有独立同分布权重的标准高斯初始化会加速收敛，并且展示了如何通过基于动态等谱性的初始化原理初始化非线性网络以获得最佳效果。

优化深度线性网络中正交初始化的可证明优势