This research explores strategies for steering the output of large language
models (LLMs) towards specific styles, such as sentiment, emotion, or writing
style, by adding style vectors to the activations of hidden layers during text
generation. We show that style vectors can be simply computed from recorded
layer activations for input texts in a specific style in contrast to more
complex training-based approaches. Through a series of experiments, we
demonstrate the effectiveness of activation engineering using such style
vectors to influence the style of generated text in a nuanced and
parameterisable way, distinguishing it from prompt engineering. The presented
research constitutes a significant step towards developing more adaptive and
effective AI-empowered interactive systems.

本研究通过在文本生成过程中向隐藏层的激活添加风格向量，探索将大型语言模型 (LLMs) 的输出引导到特定风格 (如情感、情绪或写作风格) 的策略。通过一系列实验，我们展示了使用这种风格向量进行激活工程对生成文本的风格产生影响的有效性和可调节性，使其与提示工程相区别，从而促进了更具适应性和有效性的 AI 增强交互系统的发展。

用于引导生成式大型语言模型的风格向量

Style Vectors for Steering Generative Large Language Model

Reliably controlling the behavior of large language models (LLMs) is a
pressing open problem. Existing methods include supervised finetuning,
reinforcement learning from human feedback (RLHF), prompt engineering and
guided decoding. We instead investigate activation engineering: modifying
activations at inference time to predictably alter model behavior. In
particular, we bias the forward pass with an added 'steering vector' implicitly
specified through natural language.
Unlike past work which learned these steering vectors (Subramani, Suresh, and
Peters 2022; Hernandez, Li, and Andreas 2023), our Activation Addition (ActAdd)
method computes them by taking the activation differences that result from
pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet.
Our inference-time approach yields control over high-level properties of output
and preserves off-target model performance. It involves far less compute and
implementation effort compared to finetuning or RLHF, allows users to provide
natural language specifications, and its overhead scales naturally with model
size.

控制大型语言模型行为的问题已成为紧迫的开放问题。在本文中，我们提出了一种称为 Activation Addition (ActAdd) 的方法，通过在推理过程中修改激活来可预测地改变模型行为，并展示了其在 GPT-2 上的应用，以及其与微调或强化学习从人类反馈中得到控制的方法相比所需的计算量和实施工作的差异。