A recent line of research has provided convergence guarantees for gradient descent algorithms in the excessive over-parameterization regime where the widths of all the hidden layers are required to be polynomially large in the number of training samples. However, the widths of practical deep networks are often only large in the first layer(s) and then start to decrease towards the output layer. This raises an interesting open question whether similar results also hold under this empirically relevant setting. Existing theoretical insights suggest that the loss surface of this class of networks is well-behaved, but these results usually do not provide direct algorithmic guarantees for optimization. In this paper, we close the gap by showing that one wide layer followed by pyramidal deep network topology suffices for gradient descent to find a global minimum with a geometric rate. Our proof is based on a weak form of Polyak-Lojasiewicz inequality which holds for deep pyramidal networks in the manifold of full-rank weight matrices.

针对深度神经网络的全局最小化问题，证明对于采用金字塔形拓扑结构，且只有第一层宽度为N的深度神经网络，可以找到和宽度多项式增长时相似的最小值。并且将该结果应用于LeCun的初始化方法，得到了单大宽度层的超参数要求为N ^ 2的结论。

一层宽层后金字塔拓扑的深度网络全局收敛