We study the error introduced by entropy regularization of infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. We provide a lower bound matching our upper bound up to a polynomial factor. Our proof relies on the correspondence of the solutions of entropy-regularized Markov decision processes with gradient flows of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. Further, this correspondence allows us to identify the limit of the gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of the Kakade gradient flow which corresponds to a time-continuous version of the natural policy gradient method. We use this to show that for entropy-regularized natural policy gradient methods the overall error decays exponentially in the square root of the number of iterations improving existing sublinear guarantees.

研究了无限时间跨度的离散折扣马尔可夫决策过程在熵正则化下引入的误差，证明了该误差在逆正则强度下按指数级别减小，在加权KL散度和值函数中均具有问题特定的指数。通过使用自然策略梯度方法中常见的黎曼度量来计算熵正则化马尔可夫决策过程的解与未正则化奖励的梯度流之间的对应关系，提供了匹配我们的上界的下界，以多项式因子缩放。此外，我们还利用这种对应关系确定了梯度流的极大熵最优策略的极限，从而刻画了与Kakade梯度流所对应的自然策略梯度方法的时间连续版本的隐含偏差。我们利用这一结果表明，在熵正则化自然策略梯度方法中，整体误差随迭代次数的平方根呈指数级别衰减，从而改进了现有的亚线性保证。

离散折扣马尔可夫决策过程中熵正则化误差的尖锐估计