比对算法的机制性理解：以DPO和毒性为案例研究

Jan, 2024

比对算法的机制性理解：以DPO和毒性为案例研究

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld...

TL;DR本文研究了对齐算法、预训练语言模型、直接偏好优化、毒性减少和模型对齐等关键主题及研究领域，并提出了一种简单的方法来逆转模型的对齐，使其恢复其有毒行为。

Abstract

While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus makin