Learning from expert demonstrations has received a lot of attention in artificial intelligence and machine learning. The goal is to infer the underlying reward function that an agent is optimizing given a set of observations of the agent's behavior over time in a variety of circumstances, the system state trajectories, and a plant model specifying the evolution of the system state for different agent's actions. The system is often modeled as a Markov decision process, that is, the next state depends only on the current state and agent's action, and the the agent's choice of action depends only on the current state. While the former is a Markovian assumption on the evolution of system state, the later assumes that the target reward function is itself Markovian. In this work, we explore learning a class of non-Markovian reward functions, known in the formal methods literature as specifications. These specifications offer better composition, transferability, and interpretability. We then show that inferring the specification can be done efficiently without unrolling the transition system. We demonstrate on a 2-d grid world example.

本文提出了一种从机器人演示中学习非马尔可夫奖励的方法，通过最大后验概率推断问题，采用最大熵原理推导出演示似然模型，并用有效的方法在候选规范的大池中搜索最有可能的规范，实验表明学习规范有助于避免由于即席奖励组合而经常出现的常见问题。

从示范中学习任务规范