In the context of average-reward reinforcement learning, the requirement for
oracle knowledge of the mixing time, a measure of the duration a Markov chain
under a fixed policy needs to achieve its stationary distribution-poses a
significant challenge for the global convergence of policy gradient methods.
This requirement is particularly problematic due to the difficulty and expense
of estimating mixing time in environments with large state spaces, leading to
the necessity of impractically long trajectories for effective gradient
estimation in practical applications. To address this limitation, we consider
the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level
Monte Carlo (MLMC) gradient estimator. With our approach, we effectively
alleviate the dependency on mixing time knowledge, a first for average-reward
MDPs global convergence. Furthermore, our approach exhibits the
tightest-available dependence of $\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$
relative to prior work. With a 2D gridworld goal-reaching navigation
experiment, we demonstrate that MAC achieves higher reward than a previous
PG-based method for average reward, Parameterized Policy Gradient with
Advantage Estimation (PPGAE), especially in cases with relatively small
training sample budget restricting trajectory length.

通过引入多层渐进策略梯度估计方法，解决了在平均奖励增强学习中混合时间知识的依赖性问题，并取得了比之前的基于策略梯度方法（PPGAE）更高的奖励表现。