The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator's reward function.

本论文提出一种名为PG-BROIL的新型策略梯度型鲁棒优化方法，用于优化平衡预期表现和风险的软鲁棒目标，并且可以在存在大量悬而未决的奖励函数的情况下实现刻画行为从无风险到会冒风险的策略优化，进而超越了最先进的仿真学习算法。

策略梯度贝叶斯鲁棒优化在模仿学习中的应用