Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches. We first show that the phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity (and not the clipping) is the cause of this implicit bias, even with additional isotropic Gaussian noise. We theoretically analyse the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings, and reveal that the implicit bias is indeed amplified by the additional noise. Thus, the performance issues of large-batch DP-SGD training are rooted in the same underlying principles as SGD, offering hope for potential improvements in large batch training strategies.

使用小批量随机梯度下降（SGD）训练深度神经网络（DNNs）相对于大批量训练具有卓越的测试性能。这种随机梯度下降的特定噪声结构被认为是导致这种隐式偏差的原因。使用差分隐私（DP）确保DNN的训练时，DP-SGD会向截断梯度添加高斯噪声。然而，大批量训练仍然导致显著的性能下降，这构成了一个重要的挑战，因为强DP保证需要使用大规模批次。我们首先展示这种现象也适用于无噪声SGD（无截断的DP-SGD），表明随机性（而不是截断）是这种隐式偏差的原因，即使加入了额外的各向同性高斯噪声。我们在线性最小二乘和对角线线性网络设置中理论上分析了连续版本的无噪声SGD所得到的解，并揭示了隐式偏差确实被额外的噪声放大。因此，大批量DP-SGD训练的性能问题根源于SGD的相同潜在原则，为大批量训练策略的潜在改进提供了希望。

含噪声SGD中的隐式偏差：与差分隐私训练的应用