Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models' performance. Our results highlight the importance of (a)model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.

通过句子级探测、诊断案例和掩蔽预测任务的评估，我们针对相对子句测试了三种模型（BERT，RoBERTa和ALBERT）的语法和语义知识，在自然数据集上，探测表明三种模型确实捕获了关于语法正确性的语言知识，但对包括语义知识在内的细粒度语言知识的诊断案例和掩蔽预测任务的评估显示明显的模型特定弱点，强烈影响模型性能。因此，我们的结果突出了通过模型比较进行评估任务和建立模型性能声明及其捕获的语言知识的重要性，超越纯粹的探测评估。

探究 Masked Language Models 中的语言知识: 以美式英语中的关系从句为例