Recent work on sparse autoencoders (SAEs) has shown promise in extracting
interpretable features from neural networks and addressing challenges with
polysemantic neurons caused by superposition. In this paper, we apply SAEs to
the early vision layers of InceptionV1, a well-studied convolutional neural
network, with a focus on curve detectors. Our results demonstrate that SAEs can
uncover new interpretable features not apparent from examining individual
neurons, including additional curve detectors that fill in previous gaps. We
also find that SAEs can decompose some polysemantic neurons into more
monosemantic constituent features. These findings suggest SAEs are a valuable
tool for understanding InceptionV1, and convolutional neural networks more
generally.

应用稀疏自编码器 (SAEs) 于卷积神经网络的早期视觉层，发现 SAEs 可以揭示从单个神经元中难以察觉的新的可解释特征，包括填补之前空白的额外曲线检测器，并将一些多义性神经元分解为更具单一语义的组成要素，这些发现表明 SAEs 是理解 InceptionV1 及卷积神经网络的有价值工具。

InceptionV1 早期视觉中缺失的曲线检测器：应用稀疏自编码器

The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders  to InceptionV1 Early Vision

Polysemantic neurons (neurons that activate for a set of unrelated features)
have been seen as a significant obstacle towards interpretability of
task-optimized deep networks, with implications for AI safety. The classic
origin story of polysemanticity is that the data contains more "features" than
neurons, such that learning to perform a task forces the network to co-allocate
multiple unrelated features to the same neuron, endangering our ability to
understand the network's internal processing. In this work, we present a second
and non-mutually exclusive origin story of polysemanticity. We show that
polysemanticity can arise incidentally, even when there are ample neurons to
represent all features in the data, using a combination of theory and
experiments. This second type of polysemanticity occurs because random
initialization can, by chance alone, initially assign multiple features to the
same neuron, and the training dynamics then strengthen such overlap. Due to its
origin, we term this \textit{incidental polysemanticity}.

多义性神经元是优化任务的深度网络中的一个重要障碍，会对人工智能安全性产生影响。本研究提出了多义性的第二种可能产生方式，名为 “偶发性多义性”，并通过理论和实验证明了这种现象的存在。

附带的多义性

Incidental Polysemanticity

Mechanistic interpretability aims to understand how models store
representations by breaking down neural networks into interpretable units.
However, the occurrence of polysemantic neurons, or neurons that respond to
multiple unrelated features, makes interpreting individual neurons challenging.
This has led to the search for meaningful vectors, known as concept vectors, in
activation space instead of individual neurons. The main contribution of this
paper is a method to disentangle polysemantic neurons into concept vectors
encapsulating distinct features. Our method can search for fine-grained
concepts according to the user's desired level of concept separation. The
analysis shows that polysemantic neurons can be disentangled into directions
consisting of linear combinations of neurons. Our evaluations show that the
concept vectors found encode coherent, human-understandable features.

研究为了实现模型机制性可解释性，提出了一种方法将多义神经元解离成概念向量来达到单一概念的表征，该方法可以根据用户所需的概念级别寻找精细的概念，分析显示多义神经元可以分解成神经元的线性组合的方向，评估表明找到的概念向量编码了连贯的人类可理解的特征。