经过1亿个单词的训练，BERT依然保持着良好状态：BERT遇见英国国家语料库

Mar, 2023

经过1亿个单词的训练，BERT依然保持着良好状态：BERT遇见英国国家语料库

Trained on 100 million words and still in shape: BERT meets British National Corpus

David Samuel, Andrey Kutuzov, Lilja Øvrelid, Erik Velldal

TL;DR本文探讨了小规模训练对于掩码语言模型的影响，使用英国国家语料库作为语料来源，进行了预训练和性能测试，并提出了优化后的LTG-BERT模型结构，为掩码语言模型的发展提供了新的思路。

Abstract

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the →