To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called \emph{interpolation setting}, there exists many-to-one \emph{alignment} between student and teacher nodes in the lowest layer under mild conditions. This suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization plays two roles: (1) it is a necessary condition for alignment to happen at the critical points, and (2) in training dynamics, it helps student nodes cover more teacher nodes with fewer iterations. Both improve generalization. Experiments justify our finding.

本文提出了一种通过梯度下降法训练 ReLU / Leaky ReLU 模型的方法，以实现两层和多层神经网络的节点专业化，证明了在适当的数据集和网络间条件下，该模型可实现特定形式的数据增强，获得固定大小的样本集，并展现出神经元节点的最小化分歧、所需最低的梯度量级和训练阶段中的归纳偏差。

有限宽度和输入维度的深度ReLU网络中的学生专业化