BriefGPT.xyz
Jan, 2024
比对算法的机制性理解:以DPO和毒性为案例研究
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
HTML
PDF
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld...
TL;DR
本文研究了对齐算法、预训练语言模型、直接偏好优化、毒性减少和模型对齐等关键主题及研究领域,并提出了一种简单的方法来逆转模型的对齐,使其恢复其有毒行为。
Abstract
While
alignment algorithms
are now commonly used to tune
pre-trained language models
towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus makin
→