We introduce methods for discovering and applying sparse feature circuits.
These are causally implicated subnetworks of human-interpretable features for
explaining language model behaviors. Circuits identified in prior work consist
of polysemantic and difficult-to-interpret units like attention heads or
neurons, rendering them unsuitable for many downstream applications. In
contrast, sparse feature circuits enable detailed understanding of
unanticipated mechanisms. Because they are based on fine-grained units, sparse
feature circuits are useful for downstream tasks: We introduce SHIFT, where we
improve the generalization of a classifier by ablating features that a human
judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised
and scalable interpretability pipeline by discovering thousands of sparse
feature circuits for automatically discovered model behaviors.

我们引入了一种发现和应用稀疏特征电路的方法，这些电路是人可解释特征的因果相关子网络，用于解释语言模型的行为。与以前的工作中的电路相反，稀疏特征电路基于细粒度单元，可以提供对预期之外的机制的详细理解，并且在下游任务中非常有用。我们介绍了 SHIFT，通过消除人类判断为任务无关的特征，改善了分类器的泛化能力。最后，我们展示了一个完全无监督和可扩展的可解释性流程，用于自动发现模型行为中的成千上万的稀疏特征电路。