BriefGPT.xyz
Oct, 2023
归因修补优于自动电路发现
Attribution Patching Outperforms Automated Circuit Discovery
HTML
PDF
Aaquib Syed, Can Rager, Arthur Conmy
TL;DR
通过应用基于归因修补的简单方法来剔除神经网络中最不重要的边缘,我们的研究比现有方法在电路恢复方面具有更高的AUC。
Abstract
automated interpretability research
has recently attracted attention as a potential research direction that could scale explanations of
neural network behavior
to large models. Existing automated circuit discover
→