Offline reinforcement learning has become one of the most practical RL
settings. A recent success story has been RLHF, offline preference-based RL
(PBRL) with preference from humans. However, most existing works on offline RL
focus on the standard setting with scalar reward feedback. It remains unknown
how to universally transfer the existing rich understanding of offline RL from
the reward-based to the preference-based setting. In this work, we propose a
general framework to bridge this gap. Our key insight is transforming
preference feedback to scalar rewards via optimal reward labeling (ORL), and
then any reward-based offline RL algorithms can be applied to the dataset with
the reward labels. We theoretically show the connection between several recent
PBRL techniques and our framework combined with specific offline RL algorithms
in terms of how they utilize the preference signals. By combining reward
labeling with different algorithms, our framework can lead to new and
potentially more efficient offline PBRL algorithms. We empirically test our
framework on preference datasets based on the standard D4RL benchmark. When
combined with a variety of efficient reward-based offline RL algorithms, the
learning result achieved under our framework is comparable to training the same
algorithm on the dataset with actual rewards in many cases and better than the
recent PBRL baselines in most cases.

提出了一个通用框架来连接偏好反馈和标量奖励，使得现有的离线 RL 算法能够适应偏好反馈，实验证明该框架加上不同算法可以获得与实际奖励训练相媲美甚至优于离线 PBRL 算法的学习效果。

最优奖励标注：连接离线偏好与基于奖励的强化学习

Optimal Reward Labeling: Bridging Offline Preference and Reward-Based  Reinforcement Learning

The recent paper `"Reward is Enough" by Silver, Singh, Precup and Sutton
posits that the concept of reward maximisation is sufficient to underpin all
intelligence, both natural and artificial. We contest the underlying assumption
of Silver et al. that such reward can be scalar-valued. In this paper we
explain why scalar rewards are insufficient to account for some aspects of both
biological and computational intelligence, and argue in favour of explicitly
multi-objective models of reward maximisation. Furthermore, we contend that
even if scalar reward functions can trigger intelligent behaviour in specific
cases, it is still undesirable to use this approach for the development of
artificial general intelligence due to unacceptable risks of unsafe or
unethical behaviour.

该论文提出了奖励最大化是所有智能的基础，但我们认为标量奖励无法解释生物和计算智能的某些方面，因此应采用显式的多目标奖励模型，并且即使标量奖励可以触发智能行为，也应避免使用这种方法来开发人工通用智能，因为会存在不安全或不道德的行为风险。