Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning
Large Language Models (LLMs) with human preferences. While these aligned
generative models have demonstrated impressive capabilities across various
tasks, the dependence on high-quality human preference data poses a costly
bottleneck in practical implementation of RLHF. Hence better and adaptive
strategies for data collection is needed. To this end, we frame RLHF as a
contextual preference bandit problem with prompts as contexts and show that the
naive way of collecting preference data by choosing prompts uniformly at random
leads to a policy that suffers an $\Omega(1)$ suboptimality gap in rewards.
Then we propose $\textit{Active Preference Optimization}$ ($\texttt{APO}$), an
algorithm that actively selects prompts to collect preference data. Under the
Bradley-Terry-Luce (BTL) preference model, \texttt{APO} achieves sample
efficiency without compromising on policy performance. We show that given a
sample budget of $T$, the suboptimality gap of a policy learned via
$\texttt{APO}$ scales as $O(1/\sqrt{T})$. Next, we propose a compute-efficient
batch version of $\texttt{APO}$ with minor modification and evaluate its
performance in practice. Experimental evaluations on a human preference dataset
validate \texttt{APO}'s efficacy as a sample-efficient and practical solution
to data collection for RLHF, facilitating alignment of LLMs with human
preferences in a cost-effective and scalable manner.

基于人类反馈的强化学习（RLHF）是将大型语言模型（LLMs）与人类偏好相一致的关键所在。然而，依赖高质量的人类偏好数据却在 RLHF 的实际实施中构成了昂贵的瓶颈。因此，需要更好和适应性更强的数据收集策略。为此，我们将 RLHF 构建为一个具有提示作为上下文的偏好赌博问题，并证明了通过随机均匀选择提示来收集偏好数据的天真方式会导致策略在奖励上产生 Ω(1) 的次优性差距。然后，我们提出一种主动选择提示以收集偏好数据的算法（	exttt {APO}），在 Bradley-Terry-Luce（BTL）偏好模型下，	exttt {APO} 在不损害策略性能的情况下实现了样本效率。我们证明，在给定采样预算 T 的情况下，通过	exttt {APO} 学得的策略的次优性差距的尺度为 O (1/√T)。接下来，我们提出了一种计算高效的	exttt {APO} 的批处理版本，并在实践中评估其性能。对于一个人类偏好数据集的实验评估验证了	exttt {APO} 作为 RLHF 数据收集的具有样本效率和实用性的解决方案，以成本有效且可扩展的方式促进 LLMs 与人类偏好的一致性。