Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted. To enable intrinsic probing, we propose a novel framework based on a decomposable multivariate Gaussian probe that allows us to determine whether the linguistic information in word embeddings is dispersed or focal. We then probe fastText and BERT for various morphosyntactic attributes across 36 languages. We find that most attributes are reliably encoded by only a few neurons, with fastText concentrating its linguistic structure more than BERT.

本文讨论了自然语言处理系统中之前探测语言结构方法的缺陷，并提出了基于多元高斯探针的内在探测框架，以便于检测词向量的语言信息。通过36种语言的实验证明，多数形态语法特征由少数神经元可靠编码，而fastText相较于BERT更加集中其语言结构。

基于维度选择的内在探测