In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.

本研究解决了预训练语言模型中多义词的上下文表示能力不足的问题。通过实证实验，我们发现不同子层对于多义词的上下文表示有显著影响，尤其是在句子中词的位置和上下文长度方面。主要发现表明，在短上下文窗口中，BERT的上层子层对于特定位置的词拥有较高的上下文能力，但这种能力不能广泛推广至其他位置和上下文大小。

探究预训练语言模型子层中多义词的上下文定位