Markov decision processes (MDPs) are formal models commonly used in
sequential decision-making. MDPs capture the stochasticity that may arise, for
instance, from imprecise actuators via probabilities in the transition
function. However, in data-driven applications, deriving precise probabilities
from (limited) data introduces statistical errors that may lead to unexpected
or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise
probabilities but instead use so-called uncertainty sets in the transitions,
accounting for such limited data. Tools from the formal verification community
efficiently compute robust policies that provably adhere to formal
specifications, like safety constraints, under the worst-case instance in the
uncertainty set. We continuously learn the transition probabilities of an MDP
in a robust anytime-learning approach that combines a dedicated Bayesian
inference scheme with the computation of robust policies. In particular, our
method (1) approximates probabilities as intervals, (2) adapts to new data that
may be inconsistent with an intermediate model, and (3) may be stopped at any
time to compute a robust policy on the uMDP that faithfully captures the data
so far. Furthermore, our method is capable of adapting to changes in the
environment. We show the effectiveness of our approach and compare it to robust
policies computed on uMDPs learned by the UCRL2 reinforcement learning
algorithm in an experimental evaluation on several benchmarks.

本文介绍了一种鲁棒的任意学习方法，该方法结合了贝叶斯推断模型和计算稳健策略的方法，以不确定性马尔科夫决策过程（uMDPs）为基础，并通过实验验证了该方法的有效性。

强健的马尔可夫决策流程即时学习

Robust Anytime Learning of Markov Decision Processes

The parameters for a Markov Decision Process (MDP) often cannot be specified
exactly. Uncertain MDPs (UMDPs) capture this model ambiguity by defining sets
which the parameters belong to. Minimax regret has been proposed as an
objective for planning in UMDPs to find robust policies which are not overly
conservative. In this work, we focus on planning for Stochastic Shortest Path
(SSP) UMDPs with uncertain cost and transition functions. We introduce a
Bellman equation to compute the regret for a policy. We propose a dynamic
programming algorithm that utilises the regret Bellman equation, and show that
it optimises minimax regret exactly for UMDPs with independent uncertainties.
For coupled uncertainties, we extend our approach to use options to enable a
trade off between computation and solution quality. We evaluate our approach on
both synthetic and real-world domains, showing that it significantly
outperforms existing baselines.

本文旨在通过引入一种 Bellman 方程式来计算政策的懊悔，提出了一种基于动态规划算法的方法，以便为具有不确定成本和转移函数的 SSP UMDPs 规划，该方法精确地优化了具有独立不确定性的 UMDPs 的最小化极大遗憾，并通过选项扩展了该方法，以使计算和解决方案质量之间存在权衡。在人造和实际领域中评估我们的方法，显示它明显优于现有的基线。