BriefGPT.xyz
Jun, 2024
运算符世界模型用于强化学习
Operator World Models for Reinforcement Learning
HTML
PDF
Pietro Novelli, Marco Pratticò, Massimiliano Pontil, Carlo Ciliberto
TL;DR
通过使用条件均值嵌入学习环境的世界模型,并利用RL的操作性表达式进行矩阵运算,结合Policy Mirror Descent(PMD)估计量,我们提出了一个新的RL算法POWR,证明了其收敛速度达到全局最优。实验结果表明我们的方法在有限和无限状态设置下是有效的。
Abstract
policy mirror descent
(PMD) is a powerful and theoretically sound methodology for sequential decision-making. However, it is not directly applicable to
reinforcement learning
(RL) due to the inaccessibility of ex
→