Transformer-based language models, more specifically BERT-based architectures
have achieved state-of-the-art performance in many downstream tasks. However,
for a relatively low-resource language such as Thai, the choices of models are
limited to training a BERT-based model based on a much smaller dataset or
finetuning multi-lingual models, both of which yield suboptimal downstream
performance. Moreover, large-scale multi-lingual pretraining does not take into
account language-specific features for Thai. To overcome these limitations, we
pretrain a language model based on RoBERTa-base architecture on a large,
deduplicated, cleaned training set (78GB in total size), curated from diverse
domains of social media posts, news articles and other publicly available
datasets. We apply text processing rules that are specific to Thai most
importantly preserving spaces, which are important chunk and sentence
boundaries in Thai before subword tokenization. We also experiment with
word-level, syllable-level and SentencePiece tokenization with a smaller
dataset to explore the effects on tokenization on downstream performance. Our
model wangchanberta-base-att-spm-uncased trained on the 78.5GB dataset
outperforms strong baselines (NBSVM, CRF and ULMFit) and multi-lingual models
(XLMR and mBERT) on both sequence classification and token classification tasks
in human-annotated, mono-lingual contexts.

在泰语这种资源相对较少的语言中，我们使用 RoBERTa-base 架构对大型、去重、清理后的训练集进行预训练，并研究了不同的标记化方式对下游性能的影响，在人工注释的单语境中，我们的模型 wangchanberta-base-att-spm-uncased 在序列分类和标记分类任务中优于强基线和多语言模型。