Neural networks sometimes exhibit grokking, a phenomenon where perfect or
near-perfect performance is achieved on a validation set well after the same
performance has been obtained on the corresponding training set. In this
workshop paper, we introduce a robust technique for measuring grokking, based
on fitting an appropriate functional form. We then use this to investigate the
sharpness of transitions in training and validation accuracy under two
settings. The first setting is the theoretical framework developed by Levi et
al. (2023) where closed form expressions are readily accessible. The second
setting is a two-layer MLP trained to predict the parity of bits, with grokking
induced by the concealment strategy of Miller et al. (2023). We find that
trends between relative grokking gap and grokking sharpness are similar in both
settings when using absolute and relative measures of sharpness. Reflecting on
this, we make progress toward explaining some trends and identify the need for
further study to untangle the various mechanisms which influence the sharpness
of grokking.

神经网络中的感知现象被称为 grokking，本文提出了一种测量 grokking 的鲁棒技术，并基于拟合合适的函数形式进行研究，发现训练和验证准确性之间的突变趋势与绝对和相对锐度测量方法相似。

测量《深入理解》中的清晰度

Measuring Sharpness in Grokking

We present a smoothly broken power law functional form (referred to by us as
a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the
scaling behaviors of deep neural networks (i.e. how the evaluation metric of
interest varies as the amount of compute used for training, number of model
parameters, training dataset size, model input size, number of training steps,
or upstream performance varies) for various architectures and for each of
various tasks within a large and diverse set of upstream and downstream tasks,
in zero-shot, prompted, and fine-tuned settings. This set includes large-scale
vision, language, audio, video, diffusion, generative modeling, multimodal
learning, contrastive learning, AI alignment, robotics, out-of-distribution
(OOD) generalization, continual learning, transfer learning, uncertainty
estimation / calibration, out-of-distribution detection, adversarial
robustness, distillation, sparsity, retrieval, quantization, pruning,
molecules, computer programming/coding, math word problems, "emergent" "phase
transitions / changes", arithmetic, unsupervised/self-supervised learning, and
reinforcement learning (single agent and multi-agent). When compared to other
functional forms for neural scaling behavior, this functional form yields
extrapolations of scaling behavior that are considerably more accurate on this
set. Moreover, this functional form accurately models and extrapolates scaling
behavior that other functional forms are incapable of expressing such as the
non-monotonic transitions present in the scaling behavior of phenomena such as
double descent and the delayed, sharp inflection points present in the scaling
behavior of tasks such as arithmetic. Lastly, we use this functional form to
glean insights about the limit of the predictability of scaling behavior. Code
is available at this https URL

研究了神经网络在多种任务中的扩展行为及其泛化预测模型，提出一种称为 BNSL 的平滑断电力法函数形式，相较于其他神经网络扩展行为函数形式，其推广的预测更加准确、准确地模拟和推广其他函数形式无法表达的特定情况下的不单调转折点和明显拐点扩展行为。