BriefGPT.xyz
May, 2024
基于稀疏自编码器的可扩展可靠电路识别在语言模型中的应用
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
HTML
PDF
Charles O'Neill, Thang Bui
TL;DR
介绍了一种使用离散稀疏自编码器在大型语言模型中发现可解释电路的高效且健壮的方法,通过训练稀疏自编码器,我们能够从仅有的正例中直接识别与电路相关的注意力头,实现了较高的准确率和召回率,同时降低运行时间。
Abstract
This paper introduces an efficient and robust method for discovering
interpretable circuits
in large language models using
discrete sparse autoencoders
. Our approach addresses key limitations of existing techniqu
→