Offline reinforcement learning (RL) presents distinct challenges as it relies
solely on observational data. A central concern in this context is ensuring the
safety of the learned policy by quantifying uncertainties associated with
various actions and environmental stochasticity. Traditional approaches
primarily emphasize mitigating epistemic uncertainty by learning risk-averse
policies, often overlooking environmental stochasticity. In this study, we
propose an uncertainty-aware distributional offline RL method to simultaneously
address both epistemic uncertainty and environmental stochasticity. We propose
a model-free offline RL algorithm capable of learning risk-averse policies and
characterizing the entire distribution of discounted cumulative rewards, as
opposed to merely maximizing the expected value of accumulated discounted
returns. Our method is rigorously evaluated through comprehensive experiments
in both risk-sensitive and risk-neutral benchmarks, demonstrating its superior
performance.

提出了一种不确定性感知的离线强化学习方法，同时解决了认知不确定性和环境随机性，能够学习风险规避策略并表征折扣累积奖励的整个分布。通过在风险敏感和风险中立基准测试中进行全面实验评估，证明了其卓越的性能。

基于不确定性的分布离线强化学习

Uncertainty-aware Distributional Offline Reinforcement Learning

Markov decision processes (MDPs) are the defacto frame-work for sequential
decision making in the presence ofstochastic uncertainty. A classical
optimization criterion forMDPs is to maximize the expected discounted-sum
pay-off, which ignores low probability catastrophic events withhighly negative
impact on the system. On the other hand,risk-averse policies require the
probability of undesirableevents to be below a given threshold, but they do not
accountfor optimization of the expected payoff. We consider MDPswith
discounted-sum payoff with failure states which repre-sent catastrophic
outcomes. The objective of risk-constrainedplanning is to maximize the expected
discounted-sum payoffamong risk-averse policies that ensure the probability to
en-counter a failure state is below a desired threshold. Our maincontribution
is an efficient risk-constrained planning algo-rithm that combines UCT-like
search with a predictor learnedthrough interaction with the MDP (in the style
of AlphaZero)and with a risk-constrained action selection via linear
pro-gramming. We demonstrate the effectiveness of our approachwith experiments
on classical MDPs from the literature, in-cluding benchmarks with an order of
10^6 states.

本研究提出了一种基于 MDPs 的风险受限规划算法，它将 UCT-like 搜索与通过线性规划实现的风险受限动作选择相结合，以最大化在低于所需阈值的情况下遇到故障状态的预期贴现总和回报。

马尔科夫决策过程中约束风险的强化学习策略

Reinforcement Learning of Risk-Constrained Policies in Markov Decision  Processes

Recent advances in deep reinforcement learning have demonstrated the
capability of learning complex control policies from many types of
environments. When learning policies for safety-critical applications, it is
essential to be sensitive to risks and avoid catastrophic events. Towards this
goal, we propose an actor-critic framework that models the uncertainty of the
future and simultaneously learns a policy based on that uncertainty model.
Specifically, given a distribution of the future return for any state and
action, we optimize policies for varying levels of conditional Value-at-Risk.
The learned policy can map the same state to different actions depending on the
propensity for risk. We demonstrate the effectiveness of our approach in the
domain of driving simulations, where we learn maneuvers in two scenarios. Our
learned controller can dynamically select actions along a continuous axis,
where safe and conservative behaviors are found at one end while riskier
behaviors are found at the other. Finally, when testing with very different
simulation parameters, our risk-averse policies generalize significantly better
compared to other reinforcement learning approaches.

该研究提出了一种基于 Actor-Critic 框架和条件风险价值的深度强化学习方法，应用于驾驶模拟中，实现了在保证安全的前提下尽量提高任务完成效率，并且相比于其他深度强化学习方法，该方法更具有泛化性。

最坏情况策略梯度

Worst Cases Policy Gradients

Partially-observable Markov decision processes (POMDPs) with discounted-sum
payoff are a standard framework to model a wide range of problems related to
decision making under uncertainty. Traditionally, the goal has been to obtain
policies that optimize the expectation of the discounted-sum payoff. A key
drawback of the expectation measure is that even low probability events with
extreme payoff can significantly affect the expectation, and thus the obtained
policies are not necessarily risk-averse. An alternate approach is to optimize
the probability that the payoff is above a certain threshold, which allows
obtaining risk-averse policies, but ignores optimization of the expectation. We
consider the expectation optimization with probabilistic guarantee (EOPG)
problem, where the goal is to optimize the expectation ensuring that the payoff
is above a given threshold with at least a specified probability. We present
several results on the EOPG problem, including the first algorithm to solve it.

本文研究了部分可观测马尔可夫决策过程在期望优化时如何确保回报具备一定概率性保证的问题，并提出了解决这种问题的算法。