Conversational Large Language Models are trained to refuse to answer harmful
questions. However, emergent jailbreaking techniques can still elicit unsafe
outputs, presenting an ongoing challenge for model alignment. To better
understand how different jailbreak types circumvent safeguards, this paper
analyses model activations on different jailbreak inputs. We find that it is
possible to extract a jailbreak vector from a single class of jailbreaks that
works to mitigate jailbreak effectiveness from other classes. This may indicate
that different kinds of effective jailbreaks operate via similar internal
mechanisms. We investigate a potential common mechanism of harmfulness feature
suppression, and provide evidence for its existence by looking at the
harmfulness vector component. These findings offer actionable insights for
developing more robust jailbreak countermeasures and lay the groundwork for a
deeper, mechanistic understanding of jailbreak dynamics in language models.

对话式大型语言模型的研究中发现，监狱破解技术可以绕过模型的安全保障，通过分析模型对不同类型的监狱破解输入的激活情况，发现可以从一类监狱破解中提取出能够减少其他类监狱破解效果的监狱破解向量，而这或许意味着不同类型的有效的监狱破解通过相似的内部机制来实现，通过研究有害特征抑制可能的共同机制，提供有利于开发更强大的监狱破解对策的实证证据，为深入理解语言模型中监狱破解动态打下基础。

理解越狱成功：大型语言模型中潜空间动力学的研究

Understanding Jailbreak Success: A Study of Latent Space Dynamics in  Large Language Models

Disentangling model activations into meaningful features is a central problem
in interpretability. However, the lack of ground-truth for these features in
realistic scenarios makes the validation of recent approaches, such as sparse
dictionary learning, elusive. To overcome this, we propose a framework to
evaluate feature dictionaries in the context of specific tasks, by comparing
them against \emph{supervised} feature dictionaries. First, we demonstrate that
supervised dictionaries achieve excellent approximation, control and
interpretability of model computations on the task. Second, we use the
supervised dictionaries to develop and contextualize evaluations of
unsupervised dictionaries along the same three axes.
We apply this framework to the indirect object identification task (IOI)
using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or
OpenWebText datasets. We find that these SAEs capture interpretable features
for the IOI task, but they are not as successful as supervised features in
controlling the model. Finally, we observe two qualitative phenomena in SAE
training: feature occlusion (where a causally relevant concept is robustly
overshadowed by even slightly higher-magnitude ones in the learned features),
and feature over-splitting (where binary features split into many smaller
features without clear interpretation). We hope that our framework will be a
useful step towards more objective and grounded evaluations of sparse
dictionary learning methods.

我们提出了一个评估特征字典的框架来解决解释性中地实际问题缺乏基本事实的问题，并应用该框架到间接对象识别任务中使用 GPT-2 Small，发现虽然稀疏自编码器可以捕捉到可解释的特征，但是它们对于控制模型的成功程度不如受监督的特征，并观察到在自编码器训练中存在的两个定性现象：特征遮挡和特征过度拆分。希望我们的框架能对稀疏字典学习方法的客观评估提供有用的步骤。

朝着基于原则的稀疏自编码器的解释性和控制性评估

Towards Principled Evaluations of Sparse Autoencoders for  Interpretability and Control

Given the success of Large Language Models (LLMs), there has been
considerable interest in studying the properties of model activations. The
literature overwhelmingly agrees that LLM representations are dominated by a
few ``outlier dimensions'' with exceedingly high variance and magnitude.
Several studies in Natural Language Processing (NLP) have sought to mitigate
the impact of such outlier dimensions and force LLMs to be isotropic (i.e.,
have uniform variance across all dimensions in embedding space). Isotropy is
thought to be a desirable property for LLMs that improves model performance and
more closely aligns textual representations with human intuition. However, many
of the claims regarding isotropy in NLP have been based on the average cosine
similarity of embeddings, which has recently been shown to be a flawed measure
of isotropy. In this paper, we propose I-STAR: IsoScore$^{\star}$-based STable
Anisotropic Regularization, a novel regularization method that can be used to
increase or decrease levels of isotropy in embedding space during training.
I-STAR uses IsoScore$^{\star}$, the first accurate measure of isotropy that is
both differentiable and stable on mini-batch computations. In contrast to
several previous works, we find that \textit{decreasing} isotropy in
contextualized embeddings improves performance on the majority of tasks and
models considered in this paper.

本文提出一种新的正则化方法 I-STAR，该方法可以在训练过程中增加或减少嵌入空间中的等向性水平，并发现在大多数任务和模型中减少等向性可以改善性能。