BriefGPT.xyz
Dec, 2023
通过表示工程将大型语言模型与人类偏好对齐
Aligning Large Language Models with Human Preferences through Representation Engineering
HTML
PDF
Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv...
TL;DR
以表征工程为灵感,通过人类反馈实现对大型语言模型(LLMs)中高层人类偏好的相关表征的识别,并通过转变其表征来实现对模型行为的精确控制。RAHF方法在捕捉和操作表征方面表现出出色的效果,能够对齐各种人类偏好,显示了推进LLM性能的潜力。
Abstract
Aligning
large language models
(LLMs) with
human preferences
is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving
→