Transformer-based models, represented by GPT-3, ChatGPT, and GPT-4, have
recently attracted increasing interest, research enthusiasm, and business
demand. However, their massive computation resources and huge memory footprint
are inevitable challenges. To tackle this issue, we propose BCT, a framework of
blockwise compression for transformers without retraining, to lower deployment
thresholds. BCT achieves more fine-grained compression of the whole
transformer, including embedding, matrix multiplication, GELU, Softmax, layer
normalization, and all the intermediate results. As a case, we compress an
efficient model with BCT and evaluate it on several General Language
Understanding Evaluation (GLUE) datasets. The results show that BCT can achieve
a less than 0.90% accuracy drop in most tasks.

提出使用 BCT 框架对 transformer 进行分块压缩的方法，以降低其巨大的计算资源和内存开销，通过在多个 GLUE 数据集上评估得出，在大多数任务中，BCT 可以实现不到 0.90％的准确性下降。

无需重新训练的基于 Transformer 的模型块压缩

Blockwise Compression of Transformer-based Models without Retraining

Knowledge distillation addresses the problem of transferring knowledge from a
teacher model to a student model. In this process, we typically have multiple
types of knowledge extracted from the teacher model. The problem is to make
full use of them to train the student model. Our preliminary study shows that:
(1) not all of the knowledge is necessary for learning a good student model,
and (2) knowledge distillation can benefit from certain knowledge at different
training steps. In response to these, we propose an actor-critic approach to
selecting appropriate knowledge to transfer during the process of knowledge
distillation. In addition, we offer a refinement of the training algorithm to
ease the computational burden. Experimental results on the GLUE datasets show
that our method outperforms several strong knowledge distillation baselines
significantly.

本文提出了一种基于演员 - 评论家方法的知识蒸馏框架，旨在从教师模型中选择适当的知识来训练学生模型，实验结果表明该方法在 GLUE 数据集上优于常规基线模型。