Pretraining deep contextualized representations using an unsupervised language modeling objective has led to large performance gains for a variety of NLP tasks. Notwithstanding their enormous success, recent work by Schick and Sch\"utze (2019) suggests that these architectures struggle to understand many rare words. For context-independent word embeddings, this problem can be addressed by explicitly relearning representations for infrequent words. In this work, we show that the very same idea can also be applied to contextualized models and clearly improves their downstream task performance. As previous approaches for relearning word embeddings are commonly based on fairly simple bag-of-words models, they are no suitable counterpart for complex language models based on deep neural networks. To overcome this problem, we introduce BERTRAM, a powerful architecture that is based on a pretrained BERT language model and capable of inferring high-quality representations for rare words through a deep interconnection of their surface form and the contexts in which they occur. Both on a rare word probing task and on three downstream task datasets, BERTRAM considerably improves representations for rare and medium frequency words compared to both a standalone BERT model and previous work.

本文提出了一种名为BERTRAM的基于BERT的结构，用于为罕见词建立高质量的嵌入表示，从而提高深层语言模型在罕见词和中频词上的表现。实验表明，在罕见词探查任务和三种下游任务上，与BERT相结合可以显著提高罕见词和中频词的表示。

BERTRAM：优化单词嵌入对上下文模型性能具有巨大影响