This work presents constrained parameter regularization (CPR), an alternative to traditional weight decay. Instead of applying a constant penalty uniformly to all parameters, we enforce an upper bound on a statistical measure (e.g., the L$_2$-norm) of individual parameter groups. This reformulates learning as a constrained optimization problem. To solve this, we utilize an adaptation of the augmented Lagrangian method. Our approach allows for varying regularization strengths across different parameter groups, removing the need for explicit penalty coefficients in the regularization terms. CPR only requires two hyperparameters and introduces no measurable runtime overhead. We offer empirical evidence of CPR's effectiveness through experiments in the "grokking" phenomenon, image classification, and language modeling. Our findings show that CPR can counteract the effects of grokking, and it consistently matches or surpasses the performance of traditional weight decay.

本研究提出了一种受限参数正则化（CPR）方法，与传统的权重衰减相比较，CPR通过对单个参数组的统计度量（例如L$_2$范数）施加上限，从而在学习过程中避免了显式的标量系数。通过应用增广Lagrangian方法解决这个约束优化问题，CPR能够根据不同参数组实现不同的正则化强度，且在运行时没有明显的开销。通过grokking现象、图像分类和语言建模的实验证明了CPR的有效性，其在抑制grokking效果方面表现突出，并且始终与或超过传统权重衰减的性能表现一致。

参数正则化中的新视角: 约束方法