The interest in linear complexity models for large language models is on the
rise, although their scaling capacity remains uncertain. In this study, we
present the scaling laws for linear complexity language models to establish a
foundation for their scalability. Specifically, we examine the scaling
behaviors of three efficient linear architectures. These include TNL, a linear
attention model with data-independent decay; HGRN2, a linear RNN with
data-dependent decay; and cosFormer2, a linear attention model without decay.
We also include LLaMA as a baseline architecture for softmax attention for
comparison. These models were trained with six variants, ranging from 70M to 7B
parameters on a 300B-token corpus, and evaluated with a total of 1,376
intermediate checkpoints on various downstream tasks. These tasks include
validation loss, commonsense reasoning, and information retrieval and
generation. The study reveals that existing linear complexity language models
exhibit similar scaling capabilities as conventional transformer-based models
while also demonstrating superior linguistic proficiency and knowledge
retention.

本研究通过研究线性复杂度语言模型的扩展性建立了基础，并对三种高效的线性架构进行了扩展行为的分析。结果显示，现有的线性复杂度语言模型在扩展能力、语言熟练度和知识保留方面与传统基于 transformer 的模型相似。

线性复杂度语言模型的尺度定律

Scaling Laws for Linear Complexity Language Models

Large language models (LLMs) have demonstrated remarkable performance on a
variety of natural language tasks based on just a few examples of natural
language instructions, reducing the need for extensive feature engineering.
However, most powerful LLMs are closed-source or limited in their capability
for languages other than English. In this technical report, we present Baichuan
2, a series of large-scale multilingual language models containing 7 billion
and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on
public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan
2 excels in vertical domains such as medicine and law. We will release all
pre-training model checkpoints to benefit the research community in better
understanding the training dynamics of Baichuan 2.

Baichuan 2 是一系列大规模多语言语言模型，包含 70 亿和 130 亿参数，从头开始训练，共有 2.6 万亿个标记。Baichuan 2 在公共基准测试中表现出与其他类似规模的开源模型相匹配或超越的性能，如 MMLU、CMMLU、GSM8K 和 HumanEval，此外，Baichuan 2 在医学和法律等垂直领域表现出色。我们将发布所有的预训练模型检查点，以便研究界更好地理解 Baichuan 2 的训练动态。

百川 2：开放的大规模语言模型

Baichuan 2: Open Large-scale Language Models

Recent advances in deep learning optimization showed that, with some
a-posteriori information on fully-trained models, it is possible to match the
same performance by simply training a subset of their parameters. Such a
discovery has a broad impact from theory to applications, driving the research
towards methods to identify the minimum subset of parameters to train without
look-ahead information exploitation. However, the methods proposed do not match
the state-of-the-art performance, and rely on unstructured sparsely connected
models. In this work we shift our focus from the single parameters to the
behavior of the whole neuron, exploiting the concept of neuronal equilibrium
(NEq). When a neuron is in a configuration at equilibrium (meaning that it has
learned a specific input-output relationship), we can halt its update; on the
contrary, when a neuron is at non-equilibrium, we let its state evolve towards
an equilibrium state, updating its parameters. The proposed approach has been
tested on different state-of-the-art learning strategies and tasks, validating
NEq and observing that the neuronal equilibrium depends on the specific
learning setup.

通过利用神经元平衡的概念，从单个参数转向整个神经元的行为，实施参数训练，并测试不同的学习策略和任务，验证神经元平衡，并观察神经元平衡取决于特定的学习设置，从而达到与现有技术同等性能的研究。

深度模型中的神经元平衡问题：更新与否？

To update or not to update? Neurons at equilibrium in deep models

Finding parameters in a deep neural network (NN) that fit training data is a
nonconvex optimization problem, but a basic first-order optimization method
(gradient descent) finds a global optimizer with perfect fit (zero-loss) in
many practical situations. We examine this phenomenon for the case of Residual
Neural Networks (ResNet) with smooth activation functions in a limiting regime
in which both the number of layers (depth) and the number of weights in each
layer (width) go to infinity. First, we use a mean-field-limit argument to
prove that the gradient descent for parameter training becomes a gradient flow
for a probability distribution that is characterized by a partial differential
equation (PDE) in the large-NN limit. Next, we show that under certain
assumptions, the solution to the PDE converges in the training time to a
zero-loss solution. Together, these results suggest that the training of the
ResNet gives a near-zero loss if the ResNet is large enough. We give estimates
of the depth and width needed to reduce the loss below a given threshold, with
high probability.

研究无限深度和无限宽度下 Residual 神经网络中梯度下降和凸优化的等效性，得出当神经网络足够大时，ResNet 的训练可以得到几乎没有误差的近似解决方案。