This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.

该研究通过神经切向核（NTK）模式下的梯度下降探讨了训练一层过度参数化的ReLU网络，其中网络的偏置被初始化为某个常量而不是零。该初始化的诱人好处是神经网络将可以在整个训练过程中保持稀疏激活，从而实现快速训练。结果表明，在稀疏化后，网络可以实现与密集网络一样快的收敛速度。其次，提供了宽度稀疏性的相关性，给出了一个稀疏性相关的Rademacher复杂度和泛化性能界限。最后，研究了极限NTK的最小特征值，发现可以使用可训练偏置来提高推广性。

大偏差下宽神经网络的收敛性和泛化性