We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.

我们引入了一种发现和应用稀疏特征电路的方法，这些电路是人可解释特征的因果相关子网络，用于解释语言模型的行为。与以前的工作中的电路相反，稀疏特征电路基于细粒度单元，可以提供对预期之外的机制的详细理解，并且在下游任务中非常有用。我们介绍了SHIFT，通过消除人类判断为任务无关的特征，改善了分类器的泛化能力。最后，我们展示了一个完全无监督和可扩展的可解释性流程，用于自动发现模型行为中的成千上万的稀疏特征电路。

稀疏特征电路：在语言模型中发现和编辑可解释性因果图