We identify and solve a hidden-layer model that is analytically tractable at any finite width and whose limits exhibit both the kernel phase and the feature learning phase. We analyze the phase diagram of this model in all possible limits of common hyperparameters including width, layer-wise learning rates, scale of output, and scale of initialization. We apply our result to analyze how and when feature learning happens in both infinite and finite-width models. Three prototype mechanisms of feature learning are identified: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling. In sharp contrast, neither of these mechanisms is present when the model is in the kernel regime. This discovery explains why large initialization often leads to worse performance. Lastly, we empirically demonstrate that discoveries we made for this analytical model also appear in nonlinear networks in real tasks.

我们通过分析一个隐藏层模型的相图，发现其具有核心相和特征学习相，并研究了包括宽度、层内学习率、输出尺度和初始化尺度等超参数各种可能的极限情况。我们运用这一结果在无限宽度和有限宽度模型中分析特征学习的发生方式和时机，通过对齐、失对齐和重新缩放等原型机制找到了特征学习的三种方式。与此形成鲜明对比的是，当模型处于核心相时，这些机制均不存在，这一发现解释了为何大初始化经常导致性能下降。最后，我们通过实验证明了在真实任务的非线性网络中也出现了我们在这个分析模型中的发现。

特征学习发生的时机：来自可解析模型的视角