Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. Their ever-increasing size, however, raised concerns about their effective deployment and the need for LLM compressions. This study introduces the Divergent Token metrics (DTMs), a novel approach for assessing compressed LLMs, addressing the limitations of traditional measures like perplexity that fail to accurately reflect text generation quality. DTMs focus on token divergence, providing deeper insights into the subtleties of model compression. Our results indicate that significant levels of precision and sparsity can be achieved without compromising text generation quality. Moreover, DTMs offers a more precise evaluation of each component's impact individually. Utilizing the First Divergent Token metric (FDTM) in model sparsification reveals that nearly 20% of all components can be pruned over 90%. In terms of quantization, the FDTM suggests that over 80% of parameters can be straightforwardly transformed to int8 without special outlier management.

通过引入Divergent Token metrics (DTMs)方法，本研究探索了对大型语言模型进行压缩的方法，并评估了压缩后模型的文本生成质量。结果表明，可以在不损害文本生成质量的情况下达到显著的精确度和稀疏度水平，而且DTMs可以更精确地评估模型各组件的影响。使用第一分歧标记度量（FDTM）进行模型稀疏化分析发现，可以剪枝超过90%的组件。在量化方面，FDTM建议可以将超过80%的参数直接转换为int8，而无需特殊的异常值管理。

分歧的令牌指标：测量退化以剪枝LLM组件并优化量化