Actor-critic (AC) is a powerful method for learning an optimal policy in
reinforcement learning, where the critic uses algorithms, e.g., temporal
difference (TD) learning with function approximation, to evaluate the current
policy and the actor updates the policy along an approximate gradient direction
using information from the critic. This paper provides the \textit{tightest}
non-asymptotic convergence bounds for both the AC and natural AC (NAC)
algorithms. Specifically, existing studies show that AC converges to an
$\epsilon+\varepsilon_{\text{critic}}$ neighborhood of stationary points with
the best known sample complexity of $\mathcal{O}(\epsilon^{-2})$ (up to a log
factor), and NAC converges to an
$\epsilon+\varepsilon_{\text{critic}}+\sqrt{\varepsilon_{\text{actor}}}$
neighborhood of the global optimum with the best known sample complexity of
$\mathcal{O}(\epsilon^{-3})$, where $\varepsilon_{\text{critic}}$ is the
approximation error of the critic and $\varepsilon_{\text{actor}}$ is the
approximation error induced by the insufficient expressive power of the
parameterized policy class. This paper analyzes the convergence of both AC and
NAC algorithms with compatible function approximation. Our analysis eliminates
the term $\varepsilon_{\text{critic}}$ from the error bounds while still
achieving the best known sample complexities. Moreover, we focus on the
challenging single-loop setting with a single Markovian sample trajectory. Our
major technical novelty lies in analyzing the stochastic bias due to
policy-dependent and time-varying compatible function approximation in the
critic, and handling the non-ergodicity of the MDP due to the single Markovian
sample trajectory. Numerical results are also provided in the appendix.

该研究提供了 Actor-Critic（AC）算法和 Natural Actor-Critic（NAC）算法的最紧密的非渐近收敛界限，并使用兼容函数逼近进行收敛性分析。

单回路（自然） Actor-Critic 与兼容的函数逼近的非渐近分析

Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with  Compatible Function Approximation

This paper proposes GProp, a deep reinforcement learning algorithm for
continuous policies with compatible function approximation. The algorithm is
based on two innovations. Firstly, we present a temporal-difference based
method for learning the gradient of the value-function. Secondly, we present
the deviator-actor-critic (DAC) model, which comprises three neural networks
that estimate the value function, its gradient, and determine the actor's
policy respectively. We evaluate GProp on two challenging tasks: a contextual
bandit problem constructed from nonparametric regression datasets that is
designed to probe the ability of reinforcement learning algorithms to
accurately estimate gradients; and the octopus arm, a challenging reinforcement
learning benchmark. GProp is competitive with fully supervised methods on the
bandit task and achieves the best performance to date on the octopus arm.

本研究提出一种新的深度强化学习算法 ——GProp，可用于连续动作策略的训练，算法基于在值函数的梯度上学习的时差法，并提出了包含三个神经网络的漂移器 — 演员 — 评论家（DAC）模型，分别估计值函数、梯度和确定演员策略。GProp 在两个挑战任务上进行了评估：从非参数回归数据集构建的情境赌博机问题，以及章鱼臂，其中在前者的表现上，GProp 和全监督方法不相上下，而在后者上取得了迄今为止最佳的表现。