基于分位数的强化学习策略优化

Jan, 2022

Quantile-Based Policy Optimization for Reinforcement Learning

Jinyang Jiang, Jiaqiao Hu, Yijie Peng

TL;DR本文提出了一种名为Quantile-Based Policy Optimization（QPO）的RL算法，与原有算法相比在quantile目标的情况下表现更好，算法使用神经网络对策略进行参数化，同时使用两个相互耦合的迭代来估计量位和策略参数。

Abstract

Classical reinforcement learning (RL) aims to optimize the expected cumulative rewards. In this work, we consider the RL setting where the goal is to optimize the quantile of the cumulative rewards. We parameterize the policy controlling actions by →