Language model alignment has become an important component of AI safety,
allowing safe interactions between humans and language models, by enhancing
desired behaviors and inhibiting undesired ones. It is often done by tuning the
model or inserting preset aligning prompts. Recently, representation
engineering, a method which alters the model's behavior via changing its
representations post-training, was shown to be effective in aligning LLMs (Zou
et al., 2023a). Representation engineering yields gains in alignment oriented
tasks such as resistance to adversarial attacks and reduction of social biases,
but was also shown to cause a decrease in the ability of the model to perform
basic tasks. In this paper we study the tradeoff between the increase in
alignment and decrease in helpfulness of the model. We propose a theoretical
framework which provides bounds for these two quantities, and demonstrate their
relevance empirically. Interestingly, we find that while the helpfulness
generally decreases, it does so quadratically with the norm of the
representation engineering vector, while the alignment increases linearly with
it, indicating a regime in which it is efficient to use representation
engineering. We validate our findings empirically, and chart the boundaries to
the usefulness of representation engineering for alignment.

语言模型对齐是 AI 安全的重要组成部分，通过增强期望行为和抑制非期望行为，使人类和语言模型之间进行安全交互。在这篇论文中，我们研究了对齐增加和模型有用性减少之间的权衡，并提出了一个理论框架，以在实证上证明其相关性。我们发现，当表示工程向量的范数线性增加时，模型的对齐线性增加，而模型的有用性则呈二次减少，这表明表示工程的使用是有效的。我们通过实验证实了我们的发现，并勾勒出表示工程在对齐中的有用性边界。