信任域策略优化

Feb, 2015

Trust Region Policy Optimization

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel

TL;DR本文提出了一种名为TRPO的实用算法，通过优化政策来达到保证单调改善的目的，并通过一系列实验展示了其在学习模拟机器人的Swimming、Hopping以及Walking，并使用屏幕图像玩Atari游戏等众多方面的优越表现。

Abstract

We propose a family of trust region policy optimization (TRPO) algorithms for learning control policies. We first develop a policy update scheme with guaranteed monotonic improvement, and then we describe a finit