The Transformer architecture is superior to RNN-based models in computational
efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer
models on various NLP tasks using pre-trained language models on large-scale
corpora. Surprisingly, these Transformer architectures are suboptimal for
language model itself. Neither self-attention nor the positional encoding in
the Transformer is able to efficiently incorporate the word-level sequential
context crucial to language modeling.
In this paper, we explore effective Transformer architectures for language
model, including adding additional LSTM layers to better capture the sequential
context while still keeping the computation efficient. We propose Coordinate
Architecture Search (CAS) to find an effective architecture through iterative
refinement of the model. Experimental results on the PTB, WikiText-2, and
WikiText-103 show that CAS achieves perplexities between 20.42 and 34.11 on all
problems, i.e. on average an improvement of 12.0 perplexity units compared to
state-of-the-art LSTMs. The source code is publicly available.

本篇论文针对 Transformer 架构不足以高效融合语言建模所需的单词级序列上下文，提出了在保持计算效率的同时通过添加额外的 LSTM 层能够更好地捕捉顺序上下文的有效 Transformer 架构，其中 Coordinate Architecture Search（CAS）通过迭代模型的精炼来找到一个有效的架构，实验结果表明 CAS 在所有问题上的 perplexities 达到了 20.42 ~ 34.11，即比最先进的 LSTM 提高了 12.0 perplexity 单位。

基于 Transformer 的语言模型

Language Models with Transformers

Selecting optimal parameters for a neural network architecture can often make
the difference between mediocre and state-of-the-art performance. However,
little is published which parameters and design choices should be evaluated or
selected making the correct hyperparameter optimization often a "black art that
requires expert experiences" (Snoek et al., 2012). In this paper, we evaluate
the importance of different network design choices and hyperparameters for five
common linguistic sequence tagging tasks (POS, Chunking, NER, Entity
Recognition, and Event Detection). We evaluated over 50.000 different setups
and found, that some parameters, like the pre-trained word embeddings or the
last layer of the network, have a large impact on the performance, while other
parameters, for example the number of LSTM layers or the number of recurrent
units, are of minor importance. We give a recommendation on a configuration
that performs well among different tasks.

通过评估超过 50,000 种不同的设置，我们发现网络设计选择和超参数对五个常见的语言序列标记任务，即 POS、块状、NER、实体识别和事件检测有显着影响，尤其是预先训练的词嵌入或者网路的最后一层。对于 LSTM 层数或循环单元的数量等其他参数相对不太重要。我们建议一种配置，可以在不同任务之间表现优异。