Recently, contextualized word embeddings outperform static word embeddings on many NLP tasks. However, we still don't know much about the mechanism inside these internal representations produced by BERT. Do they have any common patterns? What are the relations between word sense and context? We find that nearly all the contextualized word vectors of BERT and RoBERTa have some common patterns. For BERT, the $557^{th}$ element is always the smallest. For RoBERTa, the $588^{th}$ element is always the largest and the $77^{th}$ element is the smallest. We call them as "tails" of models. We find that these "tails" are the major cause of anisotrpy of the vector space. After "cutting the tails", the same word's different vectors are more similar to each other. The internal representations also perform better on word-in-context (WiC) task. These suggest that "cutting the tails" can decrease the influence of context and better represent word sense.

本文研究表明，基于预训练掩码语言模型的编码器推导出的上下文化字向量在层间分享一种共同的、可能不太理想的模式，即BERT和RoBERTa的隐藏状态向量中存在持续的异常值神经元。这些异常值与位置嵌入所捕获的信息密切相关。我们显示，裁剪这些异常值可提高向量之间的相似性，并导致更好的句子嵌入。

位置性伪像在掩码语言模型嵌入中传播