BriefGPT.xyz
Jul, 2019
软基线增强的安全策略改进
Safe Policy Improvement with Soft Baseline Bootstrapping
HTML
PDF
Kimia Nadjahi, Romain Laroche, Rémi Tachet des Combes
TL;DR
本文通过采用基于基线的自举算法(SPIBB),允许在更广泛的策略集合上进行策略搜索,通过控制局部模型不确定性来约束政策变化,对捕获不良行为的风险进行更全面的评估,实验结果表明相对于现有的SPI算法,本文提出的方法在有限MDP和具有神经网络函数近似的无限MDP上均有显着提高。
Abstract
batch reinforcement learning
(Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy.
safe policy improvement
(SPI) provides guarantees with high p
→