Deep Reinforcement Learning is widely used for aligning Large Language Models
(LLM) with human preference. However, the conventional reward modelling has
predominantly depended on human annotations provided by a select cohort of
individuals. Such dependence may unintentionally result in models that are
skewed to reflect the inclinations of these annotators, thereby failing to
represent the expectations of the wider population adequately. In this paper,
we introduce the Distributional Preference Reward Model (DPRM), a simple yet
effective framework to align large language models with a diverse set of human
preferences. To this end, we characterize the preferences by a beta
distribution, which can dynamically adapt to fluctuations in preference trends.
On top of that, we design an optimal-transportation-based loss to calibrate
DPRM to align with the preference distribution. Finally, the expected reward is
utilized to fine-tune an LLM policy to generate responses favoured by the
population. Our experiments show that DPRM significantly enhances the alignment
of LLMs with population preference, yielding more accurate, unbiased, and
contextually appropriate responses.

分布偏好奖励模型（DPRM）是一个简单而有效的框架，通过将最大语言模型（LLM）与多样化的人类偏好对齐，以提高对人群偏好的代表性。

通过分布偏好奖励建模对齐群体反馈

Aligning Crowd Feedback via Distributional Preference Reward Modeling

We generalise the problem of reward modelling (RM) for reinforcement learning
(RL) to handle non-Markovian rewards. Existing work assumes that human
evaluators observe each step in a trajectory independently when providing
feedback on agent behaviour. In this work, we remove this assumption, extending
RM to capture temporal dependencies in human assessment of trajectories. We
show how RM can be approached as a multiple instance learning (MIL) problem,
where trajectories are treated as bags with return labels, and steps within the
trajectories are instances with unseen reward labels. We go on to develop new
MIL models that are able to capture the time dependencies in labelled
trajectories. We demonstrate on a range of RL tasks that our novel MIL models
can reconstruct reward functions to a high level of accuracy, and can be used
to train high-performing agent policies.

本文中，我们将奖励建模应用于处理非马尔可夫奖励的强化学习问题，我们在此基础上移除了现有工作假设的独立反馈观察前提，并扩展了奖励建模以捕捉人类对轨迹的时间依赖关系。我们将其作为多实例学习（MIL）问题，通过将轨迹视为带有返回标签的包，将轨迹中的步骤视为具有未见过奖励标签的实例。我们还开发了新的多实例学习模型，能够捕捉标记轨迹中的时间依赖关系，并在一系列强化学习任务中展示了我们的新模型能够将奖励函数重建到高精度，并能用于训练高性能代理策略。

可解释的多实例学习实现基于轨迹标签的非马尔可夫奖励建模

Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning

We present an algorithm for Inverse Reinforcement Learning (IRL) from expert
state observations only. Our approach decouples reward modelling from policy
learning, unlike state-of-the-art adversarial methods which require updating
the reward model during policy search and are known to be unstable and
difficult to optimize. Our method, IL-flOw, recovers the expert policy by
modelling state-state transitions, by generating rewards using deep density
estimators trained on the demonstration trajectories, avoiding the instability
issues of adversarial methods. We demonstrate that using the state transition
log-probability density as a reward signal for forward reinforcement learning
translates to matching the trajectory distribution of the expert
demonstrations, and experimentally show good recovery of the true reward signal
as well as state of the art results for imitation from observation on
locomotion and robotic continuous control tasks.

本论文介绍了一种基于状态观测的逆强化学习算法 IL-flOw，其将奖励建模与策略学习解耦，并利用深度密度估计方法生成奖励信号，避免了对抗训练方法的不稳定性问题。通过使用状态转移概率密度作为正向强化学习的奖励信号，实验结果展示了在大规模机器人控制任务上的优秀表现。