Despite the remarkable achievements of language models (LMs) across a broad
spectrum of tasks, their propensity for generating toxic outputs remains a
prevalent concern. Current solutions involving fine-tuning or auxiliary models
usually require extensive memory and computational resources, rendering them
less practical for deployment in large language models (LLMs). In this paper,
we propose DeStein, a novel method that detoxififies LMs by altering their
internal representations in the activation space with lower resource and time
cost. Specifically, we leverage self-induced steering pairs to identify
detoxification vectors through arithmetic operations in the activation space.
During inference, detoxification is achieved by blending the detoxification
vectors with the original representations. Empirical results demonstrate that
our method significantly outperforms previous state-of-the-art approaches on
popular detoxification metrics, while also maintaining satisfactory generation
quality and diversity. Furthermore, we extend our method to multiple LLMs,
demonstrating its practicality and scalability. Warning: some example model
outputs contain highly offensive or disturbing text.

该研究提出了 DeStein，一种通过调整激活空间内的内部表示来净化语言模型的新方法，该方法在资源和时间成本较低的情况下实现了混合解毒向量和原始表示，实证结果表明该方法在常用的解毒评估指标上明显优于现有的最先进方法，并保持了令人满意的生成质量和多样性，同时还将该方法扩展到多个大型语言模型，展示了其实用性和可扩展性。

DESTEIN: 通过通用导航对偶和头部激活融合实现语言模型的过滤式导航

DESTEIN: Navigating Detoxification of Language Models via Universal  Steering Pairs and Head-wise Activation Fusion

In recent advancements in Conversational Large Language Models (LLMs), a
concerning trend has emerged, showing that many new base LLMs experience a
knowledge reduction in their foundational capabilities following Supervised
Fine-Tuning (SFT). This process often leads to issues such as forgetting or a
decrease in the base model's abilities. Moreover, fine-tuned models struggle to
align with user preferences, inadvertently increasing the generation of toxic
outputs when specifically prompted. To overcome these challenges, we adopted an
innovative approach by completely bypassing SFT and directly implementing
Harmless Reinforcement Learning from Human Feedback (RLHF). Our method not only
preserves the base model's general capabilities but also significantly enhances
its conversational abilities, while notably reducing the generation of toxic
outputs. Our approach holds significant implications for fields that demand a
nuanced understanding and generation of responses, such as customer service. We
applied this methodology to Mistral, the most popular base model, thereby
creating Mistral-Plus. Our validation across 11 general tasks demonstrates that
Mistral-Plus outperforms similarly sized open-source base models and their
corresponding instruct versions. Importantly, the conversational abilities of
Mistral-Plus were significantly improved, indicating a substantial advancement
over traditional SFT models in both safety and user preference alignment.

通过采用无害的来自人类反馈的强化学习方法，我们绕过了监督微调，直接应用于 Mistral，从而创建了 Mistral-Plus，它不仅保留了基础模型的通用能力，还显著增强了其对话能力，并大幅减少了有毒输出的生成。