批量离策略搜索的同时扰动算法

Mar, 2014

Simultaneous Perturbation Algorithms for Batch Off-Policy Search

Raphael Fonteneau, Prashanth L. A

TL;DR本篇论文提出了针对离线、批处理强化学习中连续状态和动作空间的新型策略搜索算法，这些算法包括第一和第二阶策略梯度以及Newton算法，并且在梯度和代价向量中同时实现了偏差估计。该论文在一个简单的一维连续状态空间问题上证明了算法的实用性。

Abstract

We propose two novel policy search algorithms in the context of off-policy, batch mode reinforcement learning (RL) with continuous state and action spaces. Given a batch collection of trajectories, we perform