The policy gradient theorem describes the gradient of the expected discounted
return with respect to an agent's policy parameters. However, most policy
gradient methods drop the discount factor from the state distribution and
therefore do not optimize the discounted objective. What do they optimize
instead? This has been an open question for several years, and this lack of
theoretical clarity has lead to an abundance of misstatements in the
literature. We answer this question by proving that the update direction
approximated by most methods is not the gradient of any function. Further, we
argue that algorithms that follow this direction are not guaranteed to converge
to a "reasonable" fixed point by constructing a counterexample wherein the
fixed point is globally pessimal with respect to both the discounted and
undiscounted objectives. We motivate this work by surveying the literature and
showing that there remains a widespread misunderstanding regarding discounted
policy gradient methods, with errors present even in highly-cited papers
published at top conferences.

全球顶级会议发表的论文中存在误导性，关于 drop state distribution 中的折扣因素对于算法的影响，一些方法没有优化折扣奖励函数，因为它们优化的是逼近 Most method 更新方向的不可微、不存在导函数的函数，因此这些算法不保证会收敛到一个合理的最优解。

政策梯度算法是否真的是梯度算法？

Is the Policy Gradient a Gradient?

In a recent paper, "Why does deep and cheap learning work so well?", Lin and
Tegmark claim to show that the mapping between deep belief networks and the
variational renormalization group derived in [arXiv:1410.3831] is invalid, and
present a "counterexample" that claims to show that this mapping does not hold.
In this comment, we show that these claims are incorrect and stem from a
misunderstanding of the variational RG procedure proposed by Kadanoff. We also
explain why the "counterexample" of Lin and Tegmark is compatible with the
mapping proposed in [arXiv:1410.3831].

本文主要针对 Lin 和 Tegmark 最近发表的论文 “为什么深度和廉价学习如此有效？” 进行反驳，指出他们的反例并不成立，是基于对 Kadanoff 所提出的变分 RG 程序的误解，而且其反例可以与之前的研究兼容。