Advances in Large Language Models (LLMs) have led to remarkable capabilities,
yet their inner mechanisms remain largely unknown. To understand these models,
we need to unravel the functions of individual neurons and their contribution
to the network. This paper introduces a novel automated approach designed to
scale interpretability techniques across a vast array of neurons within LLMs,
to make them more interpretable and ultimately safe. Conventional methods
require examination of examples with strong neuron activation and manual
identification of patterns to decipher the concepts a neuron responds to. We
propose Neuron to Graph (N2G), an innovative tool that automatically extracts a
neuron's behaviour from the dataset it was trained on and translates it into an
interpretable graph. N2G uses truncation and saliency methods to emphasise only
the most pertinent tokens to a neuron while enriching dataset examples with
diverse samples to better encompass the full spectrum of neuron behaviour.
These graphs can be visualised to aid researchers' manual interpretation, and
can generate token activations on text for automatic validation by comparison
with the neuron's ground truth activations, which we use to show that the model
is better at predicting neuron activation than two baseline methods. We also
demonstrate how the generated graph representations can be flexibly used to
facilitate further automation of interpretability research, by searching for
neurons with particular properties, or programmatically comparing neurons to
each other to identify similar neurons. Our method easily scales to build graph
representations for all neurons in a 6-layer Transformer model using a single
Tesla T4 GPU, allowing for wide usability. We release the code and instructions
for use at this https URL

该论文介绍了一种自动化的方法来解释大型语言模型中的神经元行为，并将其转化为可解释的图形表示，从而提高大型语言模型的可解释性和安全性。

从神经元到图形：大规模解释语言模型神经元

Neuron to Graph: Interpreting Language Model Neurons at Scale

We describe a procedure for explaining neurons in deep representations by
identifying compositional logical concepts that closely approximate neuron
behavior. Compared to prior work that uses atomic labels as explanations,
analyzing neurons compositionally allows us to more precisely and expressively
characterize their behavior. We use this procedure to answer several questions
on interpretability in models for vision and natural language processing.
First, we examine the kinds of abstractions learned by neurons. In image
classification, we find that many neurons learn highly abstract but
semantically coherent visual concepts, while other polysemantic neurons detect
multiple unrelated features; in natural language inference (NLI), neurons learn
shallow lexical heuristics from dataset biases. Second, we see whether
compositional explanations give us insight into model performance: vision
neurons that detect human-interpretable concepts are positively correlated with
task performance, while NLI neurons that fire for shallow heuristics are
negatively correlated with task performance. Finally, we show how compositional
explanations provide an accessible way for end users to produce simple
"copy-paste" adversarial examples that change model behavior in predictable
ways.

我们使用一种解释深度学习表征中神经元的程序，通过识别与神经元行为密切相关的组合逻辑概念来实现，以比先前使用原子标签的解释方法更精确地描述他们的行为，并回答了一些有关视觉和自然语言处理模型可解释性的问题。