Recent research, such as BitNet, is paving the way for a new era of 1-bit
Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant,
namely BitNet b1.58, in which every single parameter (or weight) of the LLM is
ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16)
Transformer LLM with the same model size and training tokens in terms of both
perplexity and end-task performance, while being significantly more
cost-effective in terms of latency, memory, throughput, and energy consumption.
More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for
training new generations of LLMs that are both high-performance and
cost-effective. Furthermore, it enables a new computation paradigm and opens
the door for designing specific hardware optimized for 1-bit LLMs.

1-bit Large Language Models (LLMs), such as BitNet b1.58, with ternary weights, define a new scaling law and offer high-performance and cost-effective solutions for training new generations of LLMs while enabling the design of hardware optimized for 1-bit LLMs.

1 位 LLMs 的时代：所有大型语言模型都在 1.58 比特

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Recent breakthroughs in computer vision make use of large deep neural
networks, utilizing the substantial speedup offered by GPUs. For applications
running on limited hardware, however, high precision real-time processing can
still be a challenge. One approach to solving this problem is training networks
with binary or ternary weights, thus removing the need to calculate
multiplications and significantly reducing memory size. In this work, we
introduce LR-nets (Local reparameterization networks), a new method for
training neural networks with discrete weights using stochastic parameters. We
show how a simple modification to the local reparameterization trick,
previously used to train Gaussian distributed weights, enables the training of
discrete weights. Using the proposed training we test both binary and ternary
models on MNIST, CIFAR-10 and ImageNet benchmarks and reach state-of-the-art
results on most experiments.

本研究使用 LR-nets（局部重参数网络）的方法，通过对神经网络加入离散权重的简单修改，对于 MNIST、CIFAR-10 和 ImageNet 数据集进行测试，表明用离散权重的二进制和三进制模型在大部分实验中能够取得最先进的结果。

使用本地重新参数化技巧学习离散权重

Learning Discrete Weights Using the Local Reparameterization Trick

We propose a novel fine-grained quantization (FGQ) method to ternarize
pre-trained full precision models, while also constraining activations to 8 and
4-bits. Using this method, we demonstrate a minimal loss in classification
accuracy on state-of-the-art topologies without additional training. We provide
an improved theoretical formulation that forms the basis for a higher quality
solution using FGQ. Our method involves ternarizing the original weight tensor
in groups of $N$ weights. Using $N=4$, we achieve Top-1 accuracy within $3.7\%$
and $4.2\%$ of the baseline full precision result for Resnet-101 and Resnet-50
respectively, while eliminating $75\%$ of all multiplications. These results
enable a full 8/4-bit inference pipeline, with best-reported accuracy using
ternary weights on ImageNet dataset, with a potential of $9\times$ improvement
in performance. Also, for smaller networks like AlexNet, FGQ achieves
state-of-the-art results. We further study the impact of group size on both
performance and accuracy. With a group size of $N=64$, we eliminate
$\approx99\%$ of the multiplications; however, this introduces a noticeable
drop in accuracy, which necessitates fine tuning the parameters at lower
precision. We address this by fine-tuning Resnet-50 with 8-bit activations and
ternary weights at $N=64$, improving the Top-1 accuracy to within $4\%$ of the
full precision result with $<30\%$ additional training overhead. Our final
quantized model can run on a full 8-bit compute pipeline using 2-bit weights
and has the potential of up to $15\times$ improvement in performance compared
to baseline full-precision models.

本文提出了一种精细的量化方法 (Fine-grained Quantization, FGQ)，该方法可对预训练的全精度模型进行三值化，同时将激活限制为 8 位和 4 位。通过该方法，我们证明了无需额外训练，就可以在最先进的拓扑结构上实现最小分类精度损失。该方法可用于 Resnet-101 和 Resnet-50 等模型，可以消除 75% 的乘法运算，从而实现完整的 8/4 位推理管道，并在 ImageNet 数据集上实现最佳报告精度，性能提升潜力高达 9 倍。最终的量化模型可以在全精度模型的基础上提高 15 倍的性能。

利用精细量化的三元神经网络

Ternary Neural Networks with Fine-Grained Quantization

We propose a cluster-based quantization method to convert pre-trained full
precision weights into ternary weights with minimal impact on the accuracy. In
addition, we also constrain the activations to 8-bits thus enabling sub 8-bit
full integer inference pipeline. Our method uses smaller clusters of N filters
with a common scaling factor to minimize the quantization loss, while also
maximizing the number of ternary operations. We show that with a cluster size
of N=4 on Resnet-101, can achieve 71.8% TOP-1 accuracy, within 6% of the best
full precision results while replacing ~85% of all multiplications with 8-bit
accumulations. Using the same method with 4-bit weights achieves 76.3% TOP-1
accuracy which within 2% of the full precision result. We also study the impact
of the size of the cluster on both performance and accuracy, larger cluster
sizes N=64 can replace ~98% of the multiplications with ternary operations but
introduces significant drop in accuracy which necessitates fine tuning the
parameters with retraining the network at lower precision. To address this we
have also trained low-precision Resnet-50 with 8-bit activations and ternary
weights by pre-initializing the network with full precision weights and achieve
68.9% TOP-1 accuracy within 4 additional epochs. Our final quantized model can
run on a full 8-bit compute pipeline, with a potential 16x improvement in
performance compared to baseline full-precision models.

本研究提出了一种基于聚类的量化方法，将预先训练好的全精度权重转换为三元权重，并将激活约束为 8 位，从而实现小于 8 位完整整数推理管道。此方法使用较小的 N 个过滤器的簇，并使用共同缩放因子来最小化量化损失，同时最大化三元操作的数量。在 ResNet-101 上使用 N=4 的簇大小，可以在替换了 85％的所有乘法运算为 8 位累加之后，实现 71.8％的 TOP-1 准确度。 使用 4 位权重的相同方法实现 76.3％，相对于全精度结果的误差不到 2％。同时，本研究还探讨了集群大小对性能和准确性的影响。 N=64 的较大集群大小可以使用三元操作替换 98％的乘法，但会显著降低准确性，需要在更低的精度下对参数进行微调和网络重新训练。为了解决这个问题，我们还使用全精度权重预初始化网络，通过 8 位激活和三元权重训练了低精度 ResNet-50，在额外的 4 个 epoch 内实现了 68.9％的 TOP-1 准确度。最终量化模型可以在完整的 8 位计算管道上运行，相对于基线全精度模型具有潜在的 16 倍性能提升。