We study the error introduced by entropy regularization of infinite-horizon
discrete discounted Markov decision processes. We show that this error
decreases exponentially in the inverse regularization strength both in a
weighted KL-divergence and in value with a problem-specific exponent. We
provide a lower bound matching our upper bound up to a polynomial factor. Our
proof relies on the correspondence of the solutions of entropy-regularized
Markov decision processes with gradient flows of the unregularized reward with
respect to a Riemannian metric common in natural policy gradient methods.
Further, this correspondence allows us to identify the limit of the gradient
flow as the generalized maximum entropy optimal policy, thereby characterizing
the implicit bias of the Kakade gradient flow which corresponds to a
time-continuous version of the natural policy gradient method. We use this to
show that for entropy-regularized natural policy gradient methods the overall
error decays exponentially in the square root of the number of iterations
improving existing sublinear guarantees.

研究了无限时间跨度的离散折扣马尔可夫决策过程在熵正则化下引入的误差，证明了该误差在逆正则强度下按指数级别减小，在加权 KL 散度和值函数中均具有问题特定的指数。通过使用自然策略梯度方法中常见的黎曼度量来计算熵正则化马尔可夫决策过程的解与未正则化奖励的梯度流之间的对应关系，提供了匹配我们的上界的下界，以多项式因子缩放。此外，我们还利用这种对应关系确定了梯度流的极大熵最优策略的极限，从而刻画了与 Kakade 梯度流所对应的自然策略梯度方法的时间连续版本的隐含偏差。我们利用这一结果表明，在熵正则化自然策略梯度方法中，整体误差随迭代次数的平方根呈指数级别衰减，从而改进了现有的亚线性保证。

离散折扣马尔可夫决策过程中熵正则化误差的尖锐估计

Essentially Sharp Estimates on the Entropy Regularization Error in  Discrete Discounted Markov Decision Processes

Kakade's natural policy gradient method has been studied extensively in the
last years showing linear convergence with and without regularization. We study
another natural gradient method which is based on the Fisher information matrix
of the state-action distributions and has received little attention from the
theoretical side. Here, the state-action distributions follow the Fisher-Rao
gradient flow inside the state-action polytope with respect to a linear
potential. Therefore, we study Fisher-Rao gradient flows of linear programs
more generally and show linear convergence with a rate that depends on the
geometry of the linear program. Equivalently, this yields an estimate on the
error induced by entropic regularization of the linear program which improves
existing results. We extend these results and show sublinear convergence for
perturbed Fisher-Rao gradient flows and natural gradient flows up to an
approximation error. In particular, these general results cover the case of
state-action natural policy gradients.

研究了基于状态 - 动作分布的费舍尔信息矩阵的另一种自然梯度方法，并表明其具有线性收敛性和几何相关的错误估计，改善了现有结果。进一步扩展了这些结果，对于扰动费舍尔 - 劳梯度流和自然梯度流，展示了次线性收敛性以及近似误差的界限。