The prevalence of Transformer-based pre-trained language models (PLMs) has led to their wide adoption for various natural language processing tasks. However, their excessive overhead leads to large latency and computational costs. The statically compression methods allocate fixed computation to different samples, resulting in redundant computation. The dynamic token pruning method selectively shortens the sequences but are unable to change the model size and hardly achieve the speedups as static pruning. In this paper, we propose a model accelaration approaches for large language models that incorporates dynamic token downsampling and static pruning, optimized by the information bottleneck loss. Our model, Infor-Coef, achieves an 18x FLOPs speedup with an accuracy degradation of less than 8\% compared to BERT. This work provides a promising approach to compress and accelerate transformer-based models for NLP tasks.

本文提出了Infor-Coef模型使得在NLP领域中，使用动态降采样和静态剪枝的方法，通过信息瓶颈损失进行优化，实现了18倍的计算速度提升，精度下降不到8％，为压缩和加速基于Transformer的模型提供了一种有前途的方法。

Infor-Coef: 基于信息瓶颈的动态 Token 下采样方法，用于紧凑高效的语言模型