This work studies discrete-time discounted Markov decision processes with
continuous state and action spaces and addresses the inverse problem of
inferring a cost function from observed optimal behavior. We first consider the
case in which we have access to the entire expert policy and characterize the
set of solutions to the inverse problem by using occupation measures, linear
duality, and complementary slackness conditions. To avoid trivial solutions and
ill-posedness, we introduce a natural linear normalization constraint. This
results in an infinite-dimensional linear feasibility problem, prompting a
thorough analysis of its properties. Next, we use linear function approximators
and adopt a randomized approach, namely the scenario approach and related
probabilistic feasibility guarantees, to derive epsilon-optimal solutions for
the inverse problem. We further discuss the sample complexity for a desired
approximation accuracy. Finally, we deal with the more realistic case where we
only have access to a finite set of expert demonstrations and a generative
model and provide bounds on the error made when working with samples.

该研究探讨了具有连续状态和动作空间的离散时间贴现马尔可夫决策过程，并解决了从观察到的最优行为中推断成本函数的逆问题。研究首先考虑了完全掌握专家策略的情况，并通过使用职业度量、线性对偶和互补松弛条件来刻画逆问题的解集。为避免平凡解和不适当性，引入了自然线性标准化约束。这导致了一个无限维的线性可行性问题，并对其性质进行了深入分析。其次，采用线性函数逼近器和随机化方法，即场景方法和相关的概率可行性保证，为逆问题提供了 ε- 最优解。对于所需的近似精度，进一步讨论了样本复杂度。最后，针对只有有限一组专家示范和生成模型可供使用的更加现实的情况，给出了使用样本时产生的误差界限。