BriefGPT.xyz
Sep, 2022
超叠加的玩具模型
Toy Models of Superposition
HTML
PDF
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan...
TL;DR
本文提供了一个玩具模型,完全理解了多义性的出现,我们通过存储额外的稀疏特征在“叠加”的方式下实现了多义性。我们展示了一个相位变化的存在,一个令人惊讶的与均匀多面体几何的联系,以及与对抗性示例的联系的证据。我们还讨论了对机械解释的潜在影响。
Abstract
neural networks
often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as '
polysemanticity
' which makes interpretability much more challenging. This paper provides a toy model where
→