归因修补优于自动电路发现

Oct, 2023

Attribution Patching Outperforms Automated Circuit Discovery

Aaquib Syed, Can Rager, Arthur Conmy

TL;DR通过应用基于归因修补的简单方法来剔除神经网络中最不重要的边缘，我们的研究比现有方法在电路恢复方面具有更高的AUC。

Abstract

automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discover