BriefGPT.xyz
Jul, 2021
基于悲观模型的部分覆盖离线强化学习
Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling under Partial Coverage
HTML
PDF
Masatoshi Uehara, Wen Sun
TL;DR
研究在线学习中常见的数据不全覆盖情况,提出Constrained Pessimistic Policy Optimization (CPPO)算法,基于模型类别的限制来表示悲观情况,算法可以在数据不全覆盖的情况下具有PAC保证。
Abstract
We study
model-based offline reinforcement learning
with general
function approximation
. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class a
→