BriefGPT.xyz
Nov, 2023
基于截断目标函数的消极策略优化的政策梯度
Clipped-Objective Policy Gradients for Pessimistic Policy Optimization
HTML
PDF
Jared Markowitz, Edward W. Staley
TL;DR
通过简单的目标调整,我们发现在连续行动空间中,将 Proximal Policy Optimization (PPO) 的重要性采样目标替换为截断等价的基础策略梯度可以持续改善其性能,并且这种悲观的优化促进了增强性探索,从而在单任务、约束和多任务学习中产生了改进的学习效果,而不增加显著的计算成本或复杂性。
Abstract
To facilitate efficient learning,
policy gradient
approaches to deep reinforcement learning (RL) are typically paired with
variance reduction
measures and strategies for making large but safe policy changes based
→