We investigate phase transitions in a Toy Model of Superposition (TMS) using
Singular Learning Theory (SLT). We derive a closed formula for the theoretical
loss and, in the case of two hidden dimensions, discover that regular $k$-gons
are critical points. We present supporting theory indicating that the local
learning coefficient (a geometric invariant) of these $k$-gons determines phase
transitions in the Bayesian posterior as a function of training sample size. We
then show empirically that the same $k$-gon critical points also determine the
behavior of SGD training. The picture that emerges adds evidence to the
conjecture that the SGD learning trajectory is subject to a sequential learning
mechanism. Specifically, we find that the learning process in TMS, be it
through SGD or Bayesian learning, can be characterized by a journey through
parameter space from regions of high loss and low complexity to regions of low
loss and high complexity.

通过奇异学习理论，我们研究了超位叠加玩具模型中的相变。我们导出了理论损失的闭合公式，并在两个隐藏维度的情况下发现正则 k - 边形是临界点。我们提出了支持理论，表明这些 k - 边形的局部学习系数（一种几何不变量）决定了贝叶斯后验随训练样本大小的相变。然后，我们通过实验证明，这些 k - 边形临界点也决定了 SGD 训练的行为。综合所得的结论支持了一种 SGD 学习轨迹受顺序学习机制影响的猜想。具体而言，我们发现 TMS 中的学习过程（无论是通过 SGD 还是贝叶斯学习）可以被表征为在参数空间中从高损失低复杂度区域到低损失高复杂度区域的旅程。

超叠波 toy 模型中的动力学与贝叶斯相变

Dynamical versus Bayesian Phase Transitions in a Toy Model of  Superposition

Current theoretical results on optimization trajectories of neural networks
trained by gradient descent typically have the form of rigorous but potentially
loose bounds on the loss values. In the present work we take a different
approach and show that the learning trajectory can be characterized by an
explicit asymptotic at large training times. Specifically, the leading term in
the asymptotic expansion of the loss behaves as a power law $L(t) \sim
t^{-\xi}$ with exponent $\xi$ expressed only through the data dimension, the
smoothness of the activation function, and the class of function being
approximated. Our results are based on spectral analysis of the integral
operator representing the linearized evolution of a large network trained on
the expected loss. Importantly, the techniques we employ do not require
specific form of a data distribution, for example Gaussian, thus making our
findings sufficiently universal.

本文通过对梯度下降训练的神经网络的优化轨迹进行研究，展示了学习轨迹可以用大训练时间的显式渐近特征描述。

神经网络梯度下降训练中的普适性尺度律

Universal scaling laws in the gradient descent training of neural  networks

We develop a mathematically rigorous framework for multilayer neural networks
in the mean field regime. As the network's widths increase, the network's
learning trajectory is shown to be well captured by a meaningful and
dynamically nonlinear limit (the \textit{mean field} limit), which is
characterized by a system of ODEs. Our framework applies to a broad range of
network architectures, learning dynamics and network initializations. Central
to the framework is the new idea of a \textit{neuronal embedding}, which
comprises of a non-evolving probability space that allows to embed neural
networks of arbitrary widths.
Using our framework, we prove several properties of large-width multilayer
neural networks. Firstly we show that independent and identically distributed
initializations cause strong degeneracy effects on the network's learning
trajectory when the network's depth is at least four. Secondly we obtain
several global convergence guarantees for feedforward multilayer networks under
a number of different setups. These include two-layer and three-layer networks
with independent and identically distributed initializations, and multilayer
networks of arbitrary depths with a special type of correlated initializations
that is motivated by the new concept of \textit{bidirectional diversity}.
Unlike previous works that rely on convexity, our results admit non-convex
losses and hinge on a certain universal approximation property, which is a
distinctive feature of infinite-width neural networks and is shown to hold
throughout the training process. Aside from being the first known results for
global convergence of multilayer networks in the mean field regime, they
demonstrate flexibility of our framework and incorporate several new ideas and
insights that depart from the conventional convex optimization wisdom.

本研究发展了多层神经网络的数学严格框架，探究其在平均场条件下的学习轨迹，并证明了一些神经网络的性质，包括全局收敛性和初始化的影响。其中的新概念包括概率嵌入和双向多样性。

多层神经网络均值场极限的严格框架

A Rigorous Framework for the Mean Field Limit of Multilayer Neural  Networks

Abstraction reasoning is a long-standing challenge in artificial
intelligence. Recent studies suggest that many of the deep architectures that
have triumphed over other domains failed to work well in abstract reasoning. In
this paper, we first illustrate that one of the main challenges in such a
reasoning task is the presence of distracting features, which requires the
learning algorithm to leverage counterevidence and to reject any of the false
hypotheses in order to learn the true patterns. We later show that carefully
designed learning trajectory over different categories of training data can
effectively boost learning performance by mitigating the impacts of distracting
features. Inspired by this fact, we propose feature robust abstract reasoning
(FRAR) model, which consists of a reinforcement learning based teacher network
to determine the sequence of training and a student network for predictions.
Experimental results demonstrated strong improvements over baseline algorithms
and we are able to beat the state-of-the-art models by 18.7% in the RAVEN
dataset and 13.3% in the PGM dataset.

本文提出了一种功能强大的 FRAR 模型，该模型使用有计划的学习轨迹对训练数据进行分类，从而有效地提高学习性能，超过了基线算法，并在 RAVEN 数据集中击败了最先进的模型 18.7% ，在 PGM 数据集中击败了 13.3％。