As pretrained transformer language models continue to achieve
state-of-the-art performance, the Natural Language Processing community has
pushed for advances in model compression and efficient attention mechanisms to
address high computational requirements and limited input sequence length.
Despite these separate efforts, no investigation has been done into the
intersection of these two fields. In this work, we provide an evaluation of
model compression via knowledge distillation on efficient attention
transformers. We provide cost-performance trade-offs for the compression of
state-of-the-art efficient attention architectures and the gains made in
performance in comparison to their full attention counterparts. Furthermore, we
introduce a new long-context Named Entity Recognition dataset, GONERD, to train
and test the performance of NER models on long sequences. We find that
distilled efficient attention transformers can preserve a significant amount of
original model performance, preserving up to 98.6% across short-context tasks
(GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context
Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on
long-context Named Entity Recognition (GONERD), while decreasing inference
times by up to 57.8%. We find that, for most models on most tasks, performing
knowledge distillation is an effective method to yield high-performing
efficient attention models with low costs.

对高效注意力转换模型进行知识蒸馏的模型压缩评估，并通过新的长上下文命名实体识别数据集 GONERD 验证了高效注意力转换模型在保持原始模型性能的同时降低推理时间的效果。

高效 Transformer 知识蒸馏：绩效评估

Efficient Transformer Knowledge Distillation: A Performance Review

Transformer-based language models have been changing the modern Natural
Language Processing (NLP) landscape for high-resource languages such as
English, Chinese, Russian, etc. However, this technology does not yet exist for
any Ghanaian language. In this paper, we introduce the first of such models for
Twi or Akan, the most widely spoken Ghanaian language. The specific
contribution of this research work is the development of several pretrained
transformer language models for the Akuapem and Asante dialects of Twi, paving
the way for advances in application areas such as Named Entity Recognition
(NER), Neural Machine Translation (NMT), Sentiment Analysis (SA) and
Part-of-Speech (POS) tagging. Specifically, we introduce four different
flavours of ABENA -- A BERT model Now in Akan that is fine-tuned on a set of
Akan corpora, and BAKO - BERT with Akan Knowledge only, which is trained from
scratch. We open-source the model through the Hugging Face model hub and
demonstrate its use via a simple sentiment classification example.

本研究提出了第一例为 Twi 或 Akan 开发的预训练变压器语言模型，为命名实体识别、神经机器翻译、情感分析和词性标注等应用领域的进展铺平了道路。通过 ABENA 和 BAKO 等四种不同类型的 BERT 模型，此项研究对 Akuapem 和 Asante 方言的 Twi 语言进行预训练，并通过简单的情感分类示例来展示其使用和开源模型的 Hugging Face 模型库。

Twi 的上下文文本嵌入

Contextual Text Embeddings for Twi

The success of pretrained transformer language models (LMs) in natural
language processing has led to a wide range of pretraining setups. In
particular, these models employ a variety of subword tokenization methods, most
notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the
WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling
(Kudo, 2018), to segment text. However, to the best of our knowledge, the
literature does not contain a direct evaluation of the impact of tokenization
on language model pretraining. We analyze differences between BPE and unigram
LM tokenization, finding that the latter method recovers subword units that
align more closely with morphology and avoids problems stemming from BPE's
greedy construction procedure. We then compare the fine-tuned task performance
of identical transformer masked language models pretrained with these
tokenizations. Across downstream tasks and two languages (English and
Japanese), we find that the unigram LM tokenization method matches or
outperforms BPE. We hope that developers of future pretrained LMs will consider
adopting the unigram LM method over the more prevalent BPE.

分析使用不同词汇分割方法，如 BPE 和 unigram 在提前训练 Transformer 语言模型时对其细微的影响，并比较它们的效果，在任务绩效中发现 unigram 的方法匹配或优于 BPE，建议开发者在预训练时采用 unigram 方法。