Deep-learning accelerators are increasingly in demand; however, their
performance is constrained by the size of the feature map, leading to high
bandwidth requirements and large buffer sizes. We propose an adaptive scale
feature map compression technique leveraging the unique properties of the
feature map. This technique adopts independent channel indexing given the weak
channel correlation and utilizes a cubical-like block shape to benefit from
strong local correlations. The method further optimizes compression using a
switchable endpoint mode and adaptive scale interpolation to handle unimodal
data distributions, both with and without outliers. This results in 4$\times$
and up to 7.69$\times$ compression rates for 16-bit data in constant and
variable bitrates, respectively. Our hardware design minimizes area cost by
adjusting interpolation scales, which facilitates hardware sharing among
interpolation points. Additionally, we introduce a threshold concept for
straightforward interpolation, preventing the need for intricate hardware. The
TSMC 28nm implementation showcases an equivalent gate count of 6135 for the
8-bit version. Furthermore, the hardware architecture scales effectively, with
only a sublinear increase in area cost. Achieving a 32$\times$ throughput
increase meets the theoretical bandwidth of DDR5-6400 at just 7.65$\times$ the
hardware cost.

深度学习加速器的性能受到特征映射大小的限制，提出了一种自适应缩放特征映射压缩技术，通过利用特征映射的独特性质，采用独立通道索引和块状形状，以适应本地相关性，通过可切换的端点模式和自适应缩放插值来优化压缩，并且硬件设计最小化了面积成本，通过调整插值尺度方便硬件共享，实现了 32 倍的吞吐量增加，满足 DDR5-6400 的理论带宽，仅为硬件成本的 7.65 倍。

ASC：深度神经网络自适应尺度特征图压缩

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

Large language models (LLMs) have shown remarkable capabilities in various
tasks. However their huge model size and the consequent demand for
computational and memory resources also pose challenges to model deployment.
Currently, 4-bit post-training quantization (PTQ) has achieved some success in
LLMs, reducing the memory footprint by approximately 75% compared to FP16
models, albeit with some accuracy loss. In this paper, we propose SmoothQuant+,
an accurate and efficient 4-bit weight-only PTQ that requires no additional
training, which enables lossless in accuracy for LLMs for the first time. Based
on the fact that the loss of weight quantization is amplified by the activation
outliers, SmoothQuant+ smoothes the activation outliers by channel before
quantization, while adjusting the corresponding weights for mathematical
equivalence, and then performs group-wise 4-bit weight quantization for linear
layers. We have integrated SmoothQuant+ into the vLLM framework, an advanced
high-throughput inference engine specially developed for LLMs, and equipped it
with an efficient W4A16 CUDA kernels, so that vLLM can seamlessly support
SmoothQuant+ 4-bit weight quantization. Our results show that, with
SmoothQuant+, the Code Llama-34B model can be quantized and deployed on a A100
40GB GPU, achieving lossless accuracy and a throughput increase of 1.9 to 4.0
times compared to the FP16 model deployed on two A100 40GB GPUs. Moreover, the
latency per token is only 68% of the FP16 model deployed on two A100 40GB GPUs.
This is the state-of-the-art 4-bit weight quantization for LLMs as we know.

提出了 SmoothQuant + 方法，它是一种准确而高效的 4 位权重量化方法，能够无损地减小大语言模型的内存开销，并且在精确度上没有损失。通过 SmoothQuant+，Code Llama-34B 模型能够在一张 A100 40GB GPU 上实现无损的准确度，并且相较于在两张 A100 40GB GPUs 上部署的 FP16 模型，能够提高 1.9 至 4.0 倍的吞吐量，每个 token 的延迟仅为 FP16 模型的 68%。这是已知的大语言模型 4 位权重量化的最先进方法。

SmoothQuant+: 精确高效的 LLM 后训练 4 位权重量化

SmoothQuant+: Accurate and Efficient 4-bit Post-Training  WeightQuantization for LLM

Widely popular transformer-based NLP models such as BERT and Turing-NLG have
enormous capacity trending to billions of parameters. Current execution methods
demand brute-force resources such as HBM devices and high speed
interconnectivity for data parallelism. In this paper, we introduce a new
relay-style execution technique called L2L (layer-to-layer) where at any given
moment, the device memory is primarily populated only with the executing
layer(s)'s footprint. The model resides in the DRAM memory attached to either a
CPU or an FPGA as an entity we call eager param-server (EPS). To overcome the
bandwidth issues of shuttling parameters to and from EPS, the model is executed
a layer at a time across many micro-batches instead of the conventional method
of minibatches over whole model. L2L is implemented using 16GB V100 devices for
BERT-Large running it with a device batch size of up to 256. Our results show
45% reduction in memory and 40% increase in the throughput compared to the
state-of-the-art baseline. L2L is also able to fit models up to 50 Billion
parameters on a machine with a single 16GB V100 and 512GB CPU memory and
without requiring any model partitioning. L2L scales to arbitrary depth
allowing researchers to develop on affordable devices which is a big step
toward democratizing AI. By running the optimizer in the host EPS, we show a
new form of mixed precision for faster throughput and convergence. In addition,
the EPS enables dynamic neural architecture approaches by varying layers across
iterations. Finally, we also propose and demonstrate a constant memory
variation of L2L and we propose future enhancements. This work has been
performed on GPUs first, but also targeted towards all high TFLOPS/Watt
accelerators.

本研究提出一种名为 L2L 的新型执行技术，使用 16GB V100 设备可以在单个 16GB V100 和 512GB CPU 内存的机器上承载高达 50 亿个参数的模型，相比现有方法，减少了 45％的内存使用量并提高了 40％的吞吐量，实现了人工智能民主化。