This paper explores the generalization characteristics of iterative learning algorithms with bounded updates for non-convex loss functions, employing information-theoretic techniques. Our key contribution is a novel bound for the generalization error of these algorithms with bounded updates, extending beyond the scope of previous works that only focused on Stochastic Gradient Descent (SGD). Our approach introduces two main novelties: 1) we reformulate the mutual information as the uncertainty of updates, providing a new perspective, and 2) instead of using the chaining rule of mutual information, we employ a variance decomposition technique to decompose information across iterations, allowing for a simpler surrogate process. We analyze our generalization bound under various settings and demonstrate improved bounds when the model dimension increases at the same rate as the number of training data samples. To bridge the gap between theory and practice, we also examine the previously observed scaling behavior in large language models. Ultimately, our work takes a further step for developing practical generalization theories.

研究了具有有界更新的迭代学习算法在非凸损失函数上的泛化特性，采用信息论技术。我们的主要贡献是针对具有有界更新的这些算法提出了新的泛化误差界，超出了之前仅关注随机梯度下降（SGD）的范畴。我们的方法引入了两个新颖之处：1）我们将互信息重新表述为更新的不确定性，提供了新的视角；2）我们采用方差分解技术来分解迭代中的信息，而不是使用互信息的链式法则，从而实现了一个更简单的替代过程。我们在不同设置下分析了我们的泛化界，并展示了当模型维度与训练数据样本数量以相同的速率增加时改进的界限。为了弥合理论与实践之间的差距，我们还研究了大型语言模型中先前观察到的标度行为。最终，我们的工作为发展实用的泛化理论迈出了更进一步的步伐。

具有有界更新的迭代学习算法的泛化误差界