Preference-based reinforcement learning (RL) has shown potential for teaching
agents to perform the target tasks without a costly, pre-defined reward
function by learning the reward with a supervisor's preference between the two
agent behaviors. However, preference-based learning often requires a large
amount of human feedback, making it difficult to apply this approach to various
applications. This data-efficiency problem, on the other hand, has been
typically addressed by using unlabeled samples or data augmentation techniques
in the context of supervised learning. Motivated by the recent success of these
approaches, we present SURF, a semi-supervised reward learning framework that
utilizes a large amount of unlabeled samples with data augmentation. In order
to leverage unlabeled samples for reward learning, we infer pseudo-labels of
the unlabeled samples based on the confidence of the preference predictor. To
further improve the label-efficiency of reward learning, we introduce a new
data augmentation that temporally crops consecutive subsequences from the
original behaviors. Our experiments demonstrate that our approach significantly
improves the feedback-efficiency of the state-of-the-art preference-based
method on a variety of locomotion and robotic manipulation tasks.

本文提出 SURF，一种半监督的奖励学习框架，它使用大量的无标签样本和数据增强。实验表明，该方法显著提高了各种运动和机器人操作任务的最先进基于偏好的方法的反馈效率。

SURF：数据增强的半监督奖励学习用于反馈高效偏好强化学习

SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

In preference-based reinforcement learning (RL), an agent interacts with the
environment while receiving preferences instead of absolute feedback. While
there is increasing research activity in preference-based RL, the design of
formal frameworks that admit tractable theoretical analysis remains an open
challenge. Building upon ideas from preference-based bandit learning and
posterior sampling in RL, we present DUELING POSTERIOR SAMPLING (DPS), which
employs preference-based posterior sampling to learn both the system dynamics
and the underlying utility function that governs the preference feedback. As
preference feedback is provided on trajectories rather than individual
state-action pairs, we develop a Bayesian approach for the credit assignment
problem, translating preferences to a posterior distribution over state-action
reward models. We prove an asymptotic Bayesian no-regret rate for DPS with a
Bayesian linear regression credit assignment model. This is the first regret
guarantee for preference-based RL to our knowledge. We also discuss possible
avenues for extending the proof methodology to other credit assignment models.
Finally, we evaluate the approach empirically, showing competitive performance
against existing baselines.

使用基于偏好的后验采样和贝叶斯方法解决了强化学习中的信用指派问题，提出了一种新的算法 DUELING POSTERIOR SAMPLING（DPS），并且给出了第一个关于基于偏好的 RL 的后验保证率。