How does the input segmentation of pretrained language models (PLMs) affect their generalization capabilities? We present the first study investigating this question, taking BERT as the example PLM and focusing on the semantic representations of derivationally complex words. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which derivational segmentation consistently outperforms BERT's WordPiece segmentation by a large margin. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

本研究以BERT为例，探究预训练语言模型的输入分割如何影响其复杂单词的语义表示，揭示了PLMs可以解释为串行双路模型，最有意义的输入标记应该允许在新词汇上进行最佳泛化。通过一系列的语义探测任务，我们证明了有派生输入分割的DelBERT能够显著地优于WordPiece分割的BERT。减少子词切分的输入标记或许能够提高PLMs的泛化性能。

派生形态学提高BERT对复杂词汇的解释力：超神秘并不超神