In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.

在强化学习中，通过优化百分位准则计算限制数据下的高风险决策问题的鲁棒策略，通常通过构建包含真实模型的不确定性集合，并针对集合中的最坏模型来优化策略。然而，现有的工作使用贝叶斯可信区间作为不确定性集合，但往往过大且导致学习过于保守的策略。为了克服这些局限性，我们提出了一种基于风险值的动态规划算法，以无需显式构建任何不确定性集合来优化百分位准则。我们的理论和实证结果表明，我们的算法隐式构建了较小的不确定性集合，并学习了更加保守的鲁棒策略。

离线强化学习中的百分位准则优化