Despite the huge success of deep learning, our understanding to how the non-convex neural networks are trained remains rather limited. Most of existing theoretical works only tackle neural networks with one hidden layer, and little is known for multi-layer neural networks. Recurrent neural networks (RNNs) are special multi-layer networks extensively used in natural language processing applications. They are particularly hard to analyze, comparing to feedforward networks, because the weight parameters are reused across the entire time horizon. We provide arguably the first theoretical understanding to the convergence speed of training RNNs. Specifically, when the number of neurons is sufficiently large ---meaning polynomial in the training data size and in the time horizon--- and when the weights are randomly initialized, we show that gradient descent and stochastic gradient descent both minimize the training loss in a linear convergence rate, that is, $\varepsilon \propto e^{-\Omega(T)}$.

本文研究了如何在训练多层神经网络时，通过采用类局部搜索方法（如随机梯度下降）避免陷入不良局部最小值，在给定非凸非光滑结构的情况下，它们如何适应随机标签；研究了在神经网络中如何使用ReLU激活函数避免指数梯度爆炸或消失；通过构建扰动理论，该理论可用于分析ReLU激活的多层网络的一阶数学逼近。

关于训练循环神经网络的收敛速率