BriefGPT.xyz
May, 2024
ConTrans: 通过概念移植进行弱到强对齐工程
ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
HTML
PDF
Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong
TL;DR
通过概念移植,我们提出了一种名为ConTrans的新框架,能够通过从源LLM对价值对齐的概念向量的细化与亚仿射变换,将其成功移植到目标LLM的残差流中,从而实现弱到强的对齐泛化和控制。
Abstract
Ensuring
large language models
(LLM) behave consistently with human goals, values, and intentions is crucial for their safety but yet computationally expensive. To reduce the computational cost of
alignment training
→