In recent years, large pre-trained Transformer networks have demonstrated
dramatic improvements in many natural language understanding tasks. However,
the huge size of these models brings significant challenges to their
fine-tuning and online deployment due to latency and cost constraints. New
hardware supporting both N:M semi-structured sparsity and low-precision integer
computation is a promising solution to boost DNN model serving efficiency.
However, there have been very few studies that systematically investigate to
what extent pre-trained Transformer networks benefit from the combination of
these techniques, as well as how to best compress each component of the
Transformer. We propose a flexible compression framework NxMiFormer that
performs simultaneous sparsification and quantization using ADMM and STE-based
QAT. Furthermore, we present and inexpensive, heuristic-driven search algorithm
that identifies promising heterogeneous compression configurations that meet a
compression ratio constraint. When evaluated across the GLUE suite of NLU
benchmarks, our approach can achieve up to 93% compression of the encoders of a
BERT model while retaining 98.2% of the original model accuracy and taking full
advantage of the hardware's capabilities. Heterogeneous configurations found
the by the search heuristic maintain 99.5% of the baseline accuracy while still
compressing the model by 87.5%.

本文提出了新的框架 NxMiFormer，同时使用 ADMM 和 STE-based QAT 进行稀疏化和量化，通过搜索算法找到最优的异构压缩配置，使预处理 Transformer 网络在 NLU 测试中达到 93% 的压缩率并保持 98% 以上的准确性。

使用低比特 NxM 稀疏压缩预训练 Transformers 以增强自然语言理解

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding

Fine-tuning of pre-trained transformer networks such as BERT yield
state-of-the-art results for text classification tasks. Typically, fine-tuning
is performed on task-specific training datasets in a supervised manner. One can
also fine-tune in unsupervised manner beforehand by further pre-training the
masked language modeling (MLM) task. Hereby, in-domain data for unsupervised
MLM resembling the actual classification target dataset allows for domain
adaptation of the model. In this paper, we compare current pre-trained
transformer networks with and without MLM fine-tuning on their performance for
offensive language detection. Our MLM fine-tuned RoBERTa-based classifier
officially ranks 1st in the SemEval 2020 Shared Task~12 for the English
language. Further experiments with the ALBERT model even surpass this result.

本文采用预训练 transformer 网络，使用无监督的 MLM 任务进行微调，提高该网络在检测攻击性语言的任务上的性能，取得可观的成果。