值函数和策略函数强化学习之间的桥梁

Feb, 2017

值函数和策略函数强化学习之间的桥梁

Bridging the Gap Between Value and Policy Based Reinforcement Learning

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

TL;DR本篇研究提出了一种新型的强化学习算法Path Consistency Learning（PCL）基于策略及价值的联系和软一致性误差最小化，能够同时学习策略和状态价值函数，较传统算法在多种基准测试中表现更优。

Abstract

We formulate a new notion of softmax temporal consistency that generalizes the standard hard-max Bellman consistency usually considered in value based reinforcement learning (RL). In particular, we show how softmax consistent action values correspond to optimal policies that maximize e