Policy optimization methods are popular reinforcement learning algorithms in
practice. Recent works have built theoretical foundation for them by proving
$\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are
tight in the worst case but often overly pessimistic. In this work, we show
that in tabular Markov decision processes (MDPs), by properly designing the
regularizer, the exploration bonus and the learning rates, one can achieve a
more favorable polylog$(T)$ regret when the losses are stochastic, without
sacrificing the worst-case guarantee in the adversarial regime. To our
knowledge, this is also the first time a gap-dependent polylog$(T)$ regret
bound is shown for policy optimization. Specifically, we achieve this by
leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy
update. Then we show that under known transitions, we can further obtain a
first-order regret bound in the adversarial regime by leveraging the
log-barrier regularizer.

本文针对 tabular Markov 决策过程中的策略优化方法，通过设计适当的正则化器、探索奖励和学习率，在损失为随机时实现更优的 Polylog (T) 的损失，而在对抗的情况下不降低最坏情况下的保障，并且使用 Tsallis Entropy 和 Shannon Entropy regularizer 实现了这一目标。同时，我们展示了在已知的转移条件下，通过利用对数障碍正则化器，在对抗情况下可以获得一阶损失保证。