This study evaluates the effectiveness of zero-shot compression techniques on
large language models (LLMs) under long-context. We identify the tendency for
computational errors to increase under long-context when employing certain
compression methods. We propose a hypothesis to explain the varied behavior of
different LLM compression techniques and explore remedies to mitigate the
performance decline observed in some techniques under long-context. This is a
course report for COS 598D Machine Learning and Systems by Prof. Kai Li at
Princeton University. Due to limited computational resources, our experiments
were conducted only on LLaMA-2-7B-32K.

在长语境下，评估零样本压缩技术对大型语言模型 (LLMs) 的有效性，发现在应用某些压缩方法时，计算错误的趋势会增加。提出一种假设来解释不同 LLM 压缩技术的不同行为，并探索减轻某些技术在长语境下性能下降的方法。

评估零射击长上下文语言模型压缩

Evaluating Zero-Shot Long-Context LLM Compression

The heavy burdens of computation and off-chip traffic impede deploying the
large scale convolution neural network on embedded platforms. As CNN is
attributed to the strong endurance to computation errors, employing block
floating point (BFP) arithmetics in CNN accelerators could save the hardware
cost and data traffics efficiently, while maintaining the classification
accuracy. In this paper, we verify the effects of word width definitions in BFP
to the CNN performance without retraining. Several typical CNN models,
including VGG16, ResNet-18, ResNet-50 and GoogLeNet, were tested in this paper.
Experiments revealed that 8-bit mantissa, including sign bit, in BFP
representation merely induced less than 0.3% accuracy loss. In addition, we
investigate the computational errors in theory and develop the noise-to-signal
ratio (NSR) upper bound, which provides the promising guidance for BFP based
CNN engine design.

本文在不重新训练的情况下测试了几种经典的卷积神经网络 (CNN) 模型，验证了使用块浮点算法 (BFP) 在 CNN 加速器中定义字宽的效果，并探究了理论计算误差，提出了噪声信号比（NSR）的上限，为基于 BFP 的 CNN 引擎设计提供了有价值的指导。