We show that the majority of the inference computations for large generative
models such as LLaMA and OPT can be performed with both weights and activations
being cast to 4 bits, in a way that leads to practical speedups while at the
same time maintaining good accuracy. We achieve this via a hybrid quantization
strategy called QUIK, which compresses most of the weights and activations to
4-bit, while keeping some outlier weights and activations in higher-precision.
Crucially, our scheme is designed with computational efficiency in mind: we
provide GPU kernels with highly-efficient layer-wise runtimes, which lead to
practical end-to-end throughput improvements of up to 3.1x relative to FP16
execution. Code and models are provided at this https URL

大多数大型生成模型的推理计算可以通过将权重和激活值均转换为 4 位来加速计算，同时保持良好的准确性；我们通过名为 QUIK 的混合量化策略实现这一目标，该策略将大多数权重和激活值压缩为 4 位，将一些异常值保留在较高精度；关键是，我们的方案专注于计算效率，提供高效的逐层 GPU 内核，相对于 FP16 执行，端到端的吞吐量可提高最多 3.1 倍。

迈向端到端基于生成型大语言模型的 4 位推理

Towards End-to-end 4-Bit Inference on Generative Large Language Models

Existing Continual Learning (CL) solutions only partially address the
constraints on power, memory and computation of the deep learning models when
deployed on low-power embedded CPUs. In this paper, we propose a CL solution
that embraces the recent advancements in CL field and the efficiency of the
Binary Neural Networks (BNN), that use 1-bit for weights and activations to
efficiently execute deep learning models. We propose a hybrid quantization of
CWR* (an effective CL approach) that considers differently forward and backward
pass in order to retain more precision during gradient update step and at the
same time minimizing the latency overhead. The choice of a binary network as
backbone is essential to meet the constraints of low power devices and, to the
best of authors' knowledge, this is the first attempt to prove on-device
learning with BNN. The experimental validation carried out confirms the
validity and the suitability of the proposed method.

现有的连续学习解决方案只在将深度学习模型部署在低功率嵌入式 CPU 上时部分地解决了功耗、内存和计算的限制。本文提出了一种连续学习解决方案，它结合了连续学习领域的最新进展和二值神经网络（BNN）的高效性，该网络使用 1 位用于权重和激活以高效执行深度学习模型。我们提出了一种混合量化的 CWR*（一种有效的连续学习方法），它在前向和反向传播时考虑了不同的因素，以保留在梯度更新步骤和最小化延迟开销时的更高精度。选择二值网络作为基础是满足低功率设备限制的关键，据作者所知，这是首次尝试证明使用 BNN 进行设备上学习的方法。进行的实验验证了所提方法的有效性和适用性。