We consider the problem of learning by demonstration from agents acting in unknown stochastic Markov environments or games. Our aim is to estimate agent preferences in order to construct improved policies for the same task that the agents are trying to solve. To do so, we extend previous probabilistic approaches for inverse reinforcement learning in known MDPs to the case of unknown dynamics or opponents. We do this by deriving two simplified probabilistic models of the demonstrator's policy and utility. For tractability, we use maximum a posteriori estimation rather than full Bayesian inference. Under a flat prior, this results in a convex optimisation problem. We find that the resulting algorithms are highly competitive against a variety of other methods for inverse reinforcement learning that do have knowledge of the dynamics.

我们考虑了在未知的随机马尔可夫环境或游戏中，从代理人的示范学习的问题。我们旨在估计代理人的偏好，以构建同一任务的改进策略。为了做到这一点，我们将已知MDP中逆强化学习的概率方法扩展到未知动态或对手的情况。我们通过导出演示者策略和效用的两个简化概率模型来实现这一点，为了易于处理，我们使用了最大后验估计而不是完整的贝叶斯推断。在先验分布相同的情况下，这结果是凸优化问题。我们发现所得到的算法与其他了解动态的逆强化学习方法相比具有很高的竞争力。

未知环境下的概率逆向强化学习