We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $\boldsymbol{x}\in \mathbb{R}^d$ is Gaussian and the target $y \in \mathbb{R}$ follows a multiple-index model, i.e., $y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$ with a noisy link function $g$. We prove that the first-layer weights of the NN converge to the $k$-dimensional principal subspace spanned by the vectors $\boldsymbol{u_1},...,\boldsymbol{u_k}$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $\mathcal{O}(\sqrt{{kd}/{T}})$ after $T$ iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + \epsilon$ by recovering the principal direction, with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $\epsilon$ is the noise. This is in contrast to the known $d^{\Omega(p)}$ sample requirement to learn any degree $p$ polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.

本文研究了使用随机梯度下降（SGD）训练任意宽度的两层神经网络（NN），其中输入x是高斯分布的，目标y遵循多指数模型，并证明了当基于SGD和权重衰减进行训练时，NN的第一层权重将收敛于真实模型的向量u1，...，uk所张成的k维主子空间，从而建立了一个独立于NN宽度的一般化误差边界，并进一步证明了，使用SGD训练的ReLU NNs可以通过恢复主方向来学习单指标目标，其样本复杂度与d成线性关系，而不是通过核区域中的任何p次多项式的已知d奥米（p）样本要求，这表明在初始化时使用SGD训练的NNs可以胜过神经切向核。

神经网络使用SGD高效地学习低维表示