We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is approximately a rank-two matrix. Empirically, we demonstrate that WD is a necessary condition for inducing this low-rank bias across both regression and classification tasks. Our work differs from previous studies as our theoretical analysis does not rely on common assumptions regarding the training data distribution, optimality of weight matrices, or specific training procedures. Furthermore, by leveraging the low-rank bias, we derive improved generalization error bounds and provide numerical evidence showing that better generalization can be achieved. Thus, our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.

本研究解决了使用权重衰减（WD）训练神经网络时低秩权重矩阵的隐式偏差问题。我们证明了ReLU神经网络经过充分训练后，权重矩阵近似为秩为二的矩阵。通过实证研究，我们展示了WD是在回归和分类任务中引导这一低秩偏差的必要条件，并提供了改进的泛化误差边界。

朝向更好的泛化：权重衰减引导神经网络低秩偏差