A crucial task in decision-making problems is reward engineering. It is common in practice that no obvious choice of reward function exists. Thus, a popular approach is to introduce human feedback during training and leverage such feedback to learn a reward function. Among all policy learning methods that use human feedback, preference-based methods have demonstrated substantial success in recent empirical applications such as InstructGPT. In this work, we develop a theory that provably shows the benefits of preference-based methods in offline contextual bandits. In particular, we improve the modeling and suboptimality analysis for running policy learning methods on human-scored samples directly. Then, we compare it with the suboptimality guarantees of preference-based methods and show that preference-based methods enjoy lower suboptimality.

决策问题中的一个关键任务是奖励工程。没有明显的奖励函数选择的情况在实践中很常见。因此，一种常见方法是在训练过程中引入人类反馈，并利用该反馈来学习奖励函数。在使用人类反馈的所有政策学习方法中，基于偏好的方法在近期的实证应用中表现出相当大的成功，如InstructGPT。本文中，我们发展了一个理论，可以证明基于偏好的方法在离线上下文剧集中的优势。特别地，我们改进了在直接人工评分样本上运行政策学习方法的建模和次优性分析。然后，我们将其与基于偏好的方法的次优性保证进行比较，证明了基于偏好的方法具有更低的次优性。

从人类偏好中证明策略学习在上下文强化学习问题中的好处