In recent years, there has been a lot of research work activity focused on
carrying out asymptotic and non-asymptotic convergence analyses for
two-timescale actor critic algorithms where the actor updates are performed on
a timescale that is slower than that of the critic. In a recent work, the
critic-actor algorithm has been presented for the infinite horizon discounted
cost setting in the look-up table case where the timescales of the actor and
the critic are reversed and asymptotic convergence analysis has been presented.
In our work, we present the first critic-actor algorithm with function
approximation and in the long-run average reward setting and present the first
finite-time (non-asymptotic) analysis of such a scheme. We obtain optimal
learning rates and prove that our algorithm achieves a sample complexity of
$\mathcal{\tilde{O}}(\epsilon^{-2.08})$ for the mean squared error of the
critic to be upper bounded by $\epsilon$ which is better than the one obtained
for actor-critic in a similar setting. We also show the results of numerical
experiments on three benchmark settings and observe that the critic-actor
algorithm competes well with the actor-critic algorithm.

我们提出了一种具有函数逼近和长期平均回报设置的第一个评判者 - 演员算法，并对此方案进行了非渐进（有限时间）分析。我们获得了最佳学习速率，并证明了我们的算法实现了关于演员 - 评判者算法类似设置下，评判者均方误差的样本复杂度能够由一个上界为 ε 的值 ο(ε^-2.08) 来确定，优于演员 - 评判者算法。我们还展示了在三个基准环境上的数值实验结果，并观察到评判者 - 演员算法与演员 - 评判者算法的竞争表现。