Many decision-making problems feature multiple objectives. In such problems,
it is not always possible to know the preferences of a decision-maker for
different objectives. However, it is often possible to observe the behavior of
decision-makers. In multi-objective decision-making, preference inference is
the process of inferring the preferences of a decision-maker for different
objectives. This research proposes a Dynamic Weight-based Preference Inference
(DWPI) algorithm that can infer the preferences of agents acting in
multi-objective decision-making problems, based on observed behavior
trajectories in the environment. The proposed method is evaluated on three
multi-objective Markov decision processes: Deep Sea Treasure, Traffic, and Item
Gathering. The performance of the proposed DWPI approach is compared to two
existing preference inference methods from the literature, and empirical
results demonstrate significant improvements compared to the baseline
algorithms, in terms of both time requirements and accuracy of the inferred
preferences. The Dynamic Weight-based Preference Inference algorithm also
maintains its performance when inferring preferences for sub-optimal behavior
demonstrations. In addition to its impressive performance, the Dynamic
Weight-based Preference Inference algorithm does not require any interactions
during training with the agent whose preferences are inferred, all that is
required is a trajectory of observed behavior.

该研究提出了一种基于动态权重的偏好推断算法，通过观察环境中的行为轨迹，能够推断多目标决策问题中代理人的偏好，实验结果表明其相较于现有方法能够显著提高推断效率和准确性。

多目标强化学习中基于动态权重的演示偏好推断方法

Inferring Preferences from Demonstrations in Multi-objective  Reinforcement Learning: A Dynamic Weight-based Approach

An important use of machine learning is to learn what people value. What
posts or photos should a user be shown? Which jobs or activities would a person
find rewarding? In each case, observations of people's past choices can inform
our inferences about their likes and preferences. If we assume that choices are
approximately optimal according to some utility function, we can treat
preference inference as Bayesian inverse planning. That is, given a prior on
utility functions and some observed choices, we invert an optimal
decision-making process to infer a posterior distribution on utility functions.
However, people often deviate from approximate optimality. They have false
beliefs, their planning is sub-optimal, and their choices may be temporally
inconsistent due to hyperbolic discounting and other biases. We demonstrate how
to incorporate these deviations into algorithms for preference inference by
constructing generative models of planning for agents who are subject to false
beliefs and time inconsistency. We explore the inferences these models make
about preferences, beliefs, and biases. We present a behavioral experiment in
which human subjects perform preference inference given the same observations
of choices as our model. Results show that human subjects (like our model)
explain choices in terms of systematic deviations from optimal behavior and
suggest that they take such deviations into account when inferring preferences.

研究机器学习中先前观察到的人们的选择，作为贝叶斯反向规划的先验，建议了一种引入计划偏差和时序不一致性的算法，通过构造计划生成模型，分析了其对偏差和忠诚度的推断。人体实验表明，人们也会从系统性偏离最佳行为中解释选择，并考虑这些偏差。