Offline reinforcement learning (RL) presents distinct challenges as it relies
solely on observational data. A central concern in this context is ensuring the
safety of the learned policy by quantifying uncertainties associated with
various actions and environmental stochasticity. Traditional approaches
primarily emphasize mitigating epistemic uncertainty by learning risk-averse
policies, often overlooking environmental stochasticity. In this study, we
propose an uncertainty-aware distributional offline RL method to simultaneously
address both epistemic uncertainty and environmental stochasticity. We propose
a model-free offline RL algorithm capable of learning risk-averse policies and
characterizing the entire distribution of discounted cumulative rewards, as
opposed to merely maximizing the expected value of accumulated discounted
returns. Our method is rigorously evaluated through comprehensive experiments
in both risk-sensitive and risk-neutral benchmarks, demonstrating its superior
performance.

提出了一种不确定性感知的离线强化学习方法，同时解决了认知不确定性和环境随机性，能够学习风险规避策略并表征折扣累积奖励的整个分布。通过在风险敏感和风险中立基准测试中进行全面实验评估，证明了其卓越的性能。

基于不确定性的分布离线强化学习

Uncertainty-aware Distributional Offline Reinforcement Learning

Many applications of imitation learning require the agent to generate the
full distribution of behaviour observed in the training data. For example, to
evaluate the safety of autonomous vehicles in simulation, accurate and diverse
behaviour models of other road users are paramount. Existing methods that
improve this distributional realism typically rely on hierarchical policies.
These condition the policy on types such as goals or personas that give rise to
multi-modal behaviour. However, such methods are often inappropriate for
stochastic environments where the agent must also react to external factors:
because agent types are inferred from the observed future trajectory during
training, these environments require that the contributions of internal and
external factors to the agent behaviour are disentangled and only internal
factors, i.e., those under the agent's control, are encoded in the type.
Encoding future information about external factors leads to inappropriate agent
reactions during testing, when the future is unknown and types must be drawn
independently from the actual future. We formalize this challenge as
distribution shift in the conditional distribution of agent types under
environmental stochasticity. We propose Robust Type Conditioning (RTC), which
eliminates this shift with adversarial training under randomly sampled types.
Experiments on two domains, including the large-scale Waymo Open Motion
Dataset, show improved distributional realism while maintaining or improving
task performance compared to state-of-the-art baselines.

針對環境的隨機性，本研究提出了 Robust Type Conditioning (RTC) 方法，通過對隨機抽樣的代理類型進行對抗性訓練，實現了分佈的逼真性，並在任務性能上保持或提升了與最先進方法相比的表現。