Language models (LMs) have proven to be powerful tools for psycholinguistic
research, but most prior work has focused on purely behavioural measures (e.g.,
surprisal comparisons). At the same time, research in model interpretability
has begun to illuminate the abstract causal mechanisms shaping LM behavior. To
help bring these strands of research closer together, we introduce CausalGym.
We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of
interpretability methods to causally affect model behaviour. To illustrate how
CausalGym can be used, we study the pythia models (14M--6.9B) and assess the
causal efficacy of a wide range of interpretability methods, including linear
probing and distributed alignment search (DAS). We find that DAS outperforms
the other methods, and so we use it to study the learning trajectory of two
difficult linguistic phenomena in pythia-1b: negative polarity item licensing
and filler--gap dependencies. Our analysis shows that the mechanism
implementing both of these tasks is learned in discrete stages, not gradually.

语言模型对于心理语言学研究具有重要作用，该研究提出了 CausalGym 框架，通过评估多种解释性方法的因果有效性来研究语言模型的行为，并发现 DAS 方法在性能上优于其他方法。在此基础上，用 pythia 模型研究了负极性项许可和填充 - 间隙依赖这两个困难的语言现象，并分析表明这两个任务的实现机制是通过离散阶段学习而非逐渐学习。

CausalGym: 在语言任务上基准测试因果解释方法

CausalGym: Benchmarking causal interpretability methods on linguistic  tasks

Natural language is an appealing medium for explaining how large language
models process and store information, but evaluating the faithfulness of such
explanations is challenging. To help address this, we develop two modes of
evaluation for natural language explanations that claim individual neurons
represent a concept in a text input. In the observational mode, we evaluate
claims that a neuron $a$ activates on all and only input strings that refer to
a concept picked out by the proposed explanation $E$. In the intervention mode,
we construe $E$ as a claim that the neuron $a$ is a causal mediator of the
concept denoted by $E$. We apply our framework to the GPT-4-generated
explanations of GPT-2 XL neurons of Bills et al. (2023) and show that even the
most confident explanations have high error rates and little to no causal
efficacy. We close the paper by critically assessing whether natural language
is a good choice for explanations and whether neurons are the best level of
analysis.

自然语言是解释大型语言模型如何处理和存储信息的一种吸引人的媒介，然而评估这种解释的忠实度是具有挑战性的。我们开发了两种模式的自然语言解释评估方法，以评估声称单个神经元在文本输入中表示概念的解释的真实性。我们将此框架应用于 Bills 等人 (2023) 提出的 GPT-2 XL 神经元的 GPT-4 生成的解释，并显示即使最有信心的解释也存在高错误率和几乎没有因果效果。我们最后对自然语言是否是解释的良好选择以及神经元是否是最佳分析级别进行了批判性评估。