Reward functions are a common way to specify the objective of a robot. As
designing reward functions can be extremely challenging, a more promising
approach is to directly learn reward functions from human teachers.
Importantly, data from human teachers can be collected either passively or
actively in a variety of forms: passive data sources include demonstrations,
(e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings)
are actively elicited. Prior research has independently applied reward learning
to these different data sources. However, there exist many domains where
multiple sources are complementary and expressive. Motivated by this general
problem, we present a framework to integrate multiple sources of information,
which are either passively or actively collected from human users. In
particular, we present an algorithm that first utilizes user demonstrations to
initialize a belief about the reward function, and then actively probes the
user with preference queries to zero-in on their true reward. This algorithm
not only enables us combine multiple data sources, but it also informs the
robot when it should leverage each type of information. Further, our approach
accounts for the human's ability to provide data: yielding user-friendly
preference queries which are also theoretically optimal. Our extensive
simulated experiments and user studies on a Fetch mobile manipulator
demonstrate the superiority and the usability of our integrated framework.

本文提出了一种从用户收集多源数据的框架，该框架结合了演示和偏好查询以学习奖励函数，可用于机器人模型中，并且在移动操作器 Fetch 上执行的模拟实验和用户研究验证了我们的方法的优越性和可用性。

从多元人类反馈中学习奖励函数：最优化整合演示和偏好

Learning Reward Functions from Diverse Sources of Human Feedback:  Optimally Integrating Demonstrations and Preferences

Training deep reinforcement learning agents complex behaviors in 3D virtual
environments requires significant computational resources. This is especially
true in environments with high degrees of aliasing, where many states share
nearly identical visual features. Minecraft is an exemplar of such an
environment. We hypothesize that interactive machine learning IML, wherein
human teachers play a direct role in training through demonstrations, critique,
or action advice, may alleviate agent susceptibility to aliasing. However,
interactive machine learning is only practical when the number of human
interactions is limited, requiring a balance between human teacher effort and
agent performance. We conduct experiments with two reinforcement learning
algorithms which enable human teachers to give action advice, Feedback
Arbitration and Newtonian Action Advice, under visual aliasing conditions. To
assess potential cognitive load per advice type, we vary the accuracy and
frequency of various human action advice techniques. Training efficiency,
robustness against infrequent and inaccurate advisor input, and sensitivity to
aliasing are examined.

使用交互式机器学习可以帮助训练具有复杂行为的深度强化学习智能体，但需要在人类教师的努力和代理性能之间实现平衡。本研究探讨了两种强化学习算法在具有视觉混淆的情况下，通过人类动作建议来提高代理性能、评估动作建议类型的潜在认知负荷以及提高训练效率和抵御错误建议的能力。