The Transformer architecture deeply changed the natural language processing,
outperforming all previous state-of-the-art models. However, well-known
Transformer models like BERT, RoBERTa, and GPT-2 require a huge compute budget
to create a high quality contextualised representation. In this paper, we study
several efficient pre-training objectives for Transformers-based models. By
testing these objectives on different tasks, we determine which of the ELECTRA
model's new features is the most relevant. We confirm that Transformers
pre-training is improved when the input does not contain masked tokens and that
the usage of the whole output to compute the loss reduces training time.
Moreover, inspired by ELECTRA, we study a model composed of two blocks; a
discriminator and a simple generator based on a statistical model with no
impact on the computational performances. Besides, we prove that eliminating
the MASK token and considering the whole output during the loss computation are
essential choices to improve performance. Furthermore, we show that it is
possible to efficiently train BERT-like models using a discriminative approach
as in ELECTRA but without a complex generator, which is expensive. Finally, we
show that ELECTRA benefits heavily from a state-of-the-art hyper-parameters
search.

本论文研究了 Transformer 模型的有效预训练目标，并探究了 ELECTRA 模型的若干新特性。结果表明，去除 mask token 以及全局损失计算有助于提升模型性能，同时参考 ELECTRA 模型的判别式方法可以更高效地训练 BERT-like 模型，并且这些方法受到超参数寻优的进一步改善。