Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun...
TL;DR语言模型对齐方法, 强化学习, 过拟合, 离线对齐算法, 样本效率。
Abstract
language model alignment methods, such as reinforcement learning from human feedback (RLHF), have led to impressive advances in language model capabilities, but existing techniques are limited by a widely observe