BriefGPT.xyz
Jun, 2024
衡量RLHF中的代码完成功能的记忆化
Measuring memorization in RLHF for code completion
HTML
PDF
Aneesh Pappu, Billy Porter, Ilia Shumailov, Jamie Hayes
TL;DR
通过分析训练数据记忆在强化学习过程中如何表现和传播的方式,研究发现强化学习与人类反馈对齐方式相比直接微调数据对齐方式,更少地导致训练数据的记忆,但已经在微调阶段记忆的样本在RLHF过程中仍然保持记忆的情况居多,这对于保护隐私可能会带来潜在风险。
Abstract
reinforcement learning with human feedback
(RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding
training
→