In this paper, we introduce the Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl. To better understand this corpus, we conduct language understanding experiments on both small and large scale, and results show that the models trained on this corpus can achieve excellent performance on Chinese. We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google. It saves computational cost and memory while works as good as original vocabulary. We also release both large and tiny versions of the pre-trained model on this corpus. The former achieves the state-of-the-art result, and the latter retains most precision while accelerating training and prediction speed for eight times compared to Bert-base. To facilitate future work on self-supervised learning on Chinese, we release our dataset, new vocabulary, codes, and pre-trained models on Github.

本文介绍了 CLUE 组织的中文语料库 CLUECorpus2020，它是一个大规模的语料库，可直接用于自监督学习。它有 100G 原始语料库，其中包含 350亿个中文字符，可以用于语言生成和语言模型的预训练。该论文进行了小型和大型的语言理解实验，结果显示训练在此语料库上的模型可以在中文上取得出色的性能。作者还发布了一个新的中文词汇表和经过预训练的模型（大型和小型版本），并将其代码和数据集发布在 Github 上供社区使用。

CLUECorpus2020：用于预训练语言模型的大规模中文语料库