This work studies the challenge of aligning large language models (LLMs) with
offline preference data. We focus on alignment by Reinforcement Learning from
Human Feedback (RLHF) in particular. While popular preference optimization
methods exhibit good empirical performance in practice, they are not
theoretically guaranteed to converge to the optimal policy and can provably
fail when the data coverage is sparse by classical offline reinforcement
learning (RL) results. On the other hand, a recent line of work has focused on
theoretically motivated preference optimization methods with provable
guarantees, but these are not computationally efficient for large-scale
applications like LLM alignment. To bridge this gap, we propose SPAC, a new
offline preference optimization method with self-play, inspired by the
on-average pessimism technique from the offline RL literature, to be the first
provable and scalable approach to LLM alignment. We both provide theoretical
analysis for its convergence under single-policy concentrability for the
general function approximation setting and demonstrate its competitive
empirical performance for LLM alignment on a 7B Mistral model with Open LLM
Leaderboard evaluations.

该研究探讨了将大型语言模型与离线喜好数据进行对齐的挑战，在特别关注强化学习从人类反馈中对齐的条件下。我们提出了一个新的离线偏好优化方法 SPAC，它通过自我对战来实现，灵感来自离线强化学习领域的平均悲观技术，将是第一个可证明且可扩展用于大规模应用的 LLM 对齐方法。我们在一款具有 Open LLM Leaderboard 评估的 7B Mistral 模型上对其收敛性进行了理论分析，并展示了其具有竞争性的实证性能。