BriefGPT.xyz
Oct, 2024
揭示电路下的语言技能
Unveiling Language Skills under Circuits
HTML
PDF
Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang
TL;DR
本研究针对现有电路分析无法全面表征语言模型(LM)功能的不足,提出了“记忆电路”的新概念,以独立操控语言模型的记忆读取功能。实验表明,我们识别的技能路径与语言技能相对应,验证了语言技能可通过电路解剖识别的假设,揭示了浅层次与深层次语言技能的分布,并表明复杂技能基于简单技能之上形成。
Abstract
The exploration of language skills in
Language Models
(LMs) has always been one of the central goals in
Mechanistic Interpretability
. However, existing circuit analyses often fall short in representing the full f
→