Preference-based reinforcement learning (RL) algorithms help avoid the
pitfalls of hand-crafted reward functions by distilling them from human
preference feedback, but they remain impractical due to the burdensome number
of labels required from the human, even for relatively simple tasks. In this
work, we demonstrate that encoding environment dynamics in the reward function
(REED) dramatically reduces the number of preference labels required in
state-of-the-art preference-based RL frameworks. We hypothesize that REED-based
methods better partition the state-action space and facilitate generalization
to state-action pairs not included in the preference dataset. REED iterates
between encoding environment dynamics in a state-action representation via a
self-supervised temporal consistency task, and bootstrapping the
preference-based reward function from the state-action representation. Whereas
prior approaches train only on the preference-labelled trajectory pairs, REED
exposes the state-action representation to all transitions experienced during
policy training. We explore the benefits of REED within the PrefPPO [1] and
PEBBLE [2] preference learning frameworks and demonstrate improvements across
experimental conditions to both the speed of policy learning and the final
policy performance. For example, on quadruped-walk and walker-walk with 50
preference labels, REED-based reward functions recover 83% and 66% of ground
truth reward policy performance and without REED only 38\% and 21\% are
recovered. For some domains, REED-based reward functions result in policies
that outperform policies trained on the ground truth reward.

本研究使用包含环境动态信息的 REED 方法，压缩了 preference-based RL 架构中需要从人获得的喜好标签数量，进而提升策略的学习速度和最终表现。

通过奖励编码环境动态性来提高基于偏好的强化学习

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

Since convolutional neural networks (ConvNets) can easily memorize noisy
labels, which are ubiquitous in visual classification tasks, it has been a
great challenge to train ConvNets against them robustly. Various solutions,
e.g., sample selection, label correction, and robustifying loss functions, have
been proposed for this challenge, and most of them stick to the end-to-end
training of the representation (feature extractor) and classifier. In this
paper, by a deep rethinking and careful re-examining on learning behaviors of
the representation and classifier, we discover that the representation is much
more fragile in the presence of noisy labels than the classifier. Thus, we are
motivated to design a new method, i.e., REED, to leverage above discoveries to
learn from noisy labels robustly. The proposed method contains three stages,
i.e., obtaining the representation by self-supervised learning without any
labels, transferring the noisy label learning problem into a semisupervised one
by the classifier directly and reliably trained with noisy labels, and joint
semi-supervised retraining of both the representation and classifier. Extensive
experiments are performed on both synthetic and real benchmark datasets.
Results demonstrate that the proposed method can beat the state-of-the-art ones
by a large margin, especially under high noise level.

本文提出了一种名为 REED 的新方法来解决卷积神经网络在存在噪声标签时训练的挑战，该方法通过无监督学习获取表示，通过分类器的半监督自训练解决噪声标签学习问题，并联合半监督重新训练表示和分类器，实现了对噪声标签的鲁棒性，大量实验结果表明，该方法在高噪声水平下可以显著击败现有的最先进方法。