We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.

我们提供了一个针对具有人类反馈的强化学习(RLHF)的理论框架。通过分析我们发现当真实奖励函数是线性的时候，最大似然估计器(MLE)在Bradley-Terry-Luce (BTL)模型和Plackett-Luce(PL)模型下都能收敛。然而，我们表明，在基于学习的奖赏模型的策略时，MLE失败，而一种悲观的MLE在某些涵盖假设下提供了改进的性能策略。此外，我们证明在PL模型下，真实MLE和一个将K路比较分成两两比较的替代MLE都会收敛。此外，真实MLE渐近地更有效。我们的结果验证了现有RLHF算法在InstructGPT上的实证成功，并为算法设计提供了新的见解。此外，我们的结果统一了RLHF问题和max-entropy Inverse Reinforcement Learning(IRL)问题，并为max-entropy IRL提供了第一个样本复杂度上界。

基于成对或K个比较的人类反馈的有原则强化学习