The specification of aMarkov decision process (MDP) can be difficult. Reward function specification is especially problematic; in practice, it is often cognitively complex and time-consuming for users to precisely specify rewards. This work casts the problem of specifying rewards as one of preference elicitation and aims to minimize the degree of precision with which a reward function must be specified while still allowing optimal or near-optimal policies to be produced. We first discuss how robust policies can be computed for MDPs given only partial reward information using the minimax regret criterion. We then demonstrate how regret can be reduced by efficiently eliciting reward information using bound queries, using regret-reduction as a means for choosing suitable queries. Empirical results demonstrate that regret-based reward elicitation offers an effective way to produce near-optimal policies without resorting to the precise specification of the entire reward function.

本论文将奖励函数规范的问题视为偏好引出问题，并旨在在仍允许产生最优或接近最优策略的情况下，最小化必须规定奖励函数的精度。通过使用极小极大后悔准则来计算只有部分奖励信息的MDP的健壮策略，然后演示如何使用边界查询高效引出奖励信息，以减少后悔，使用后悔减少作为选择适当查询的手段。实证结果表明，基于悔恨的奖励引出为生产接近最优策略提供了一种有效的方式，而不需要精确定义整个奖励函数。

基于遗憾的马尔可夫决策过程奖励引导方法