Recent advances in reinforcement learning, for partially-observable Markov
decision processes (POMDPs), rely on the biologically implausible
backpropagation through time algorithm (BPTT) to perform gradient-descent
optimisation. In this paper we propose a novel reinforcement learning algorithm
that makes use of random feedback local online learning (RFLO), a biologically
plausible approximation of realtime recurrent learning (RTRL) to compute the
gradients of the parameters of a recurrent neural network in an online manner.
By combining it with TD($\lambda$), a variant of temporaldifference
reinforcement learning with eligibility traces, we create a biologically
plausible, recurrent actor-critic algorithm, capable of solving discrete and
continuous control tasks in POMDPs. We compare BPTT, RTRL and RFLO as well as
different network architectures, and find that RFLO can perform just as well as
RTRL while exceeding even BPTT in terms of complexity. The proposed method,
called real-time recurrent reinforcement learning (RTRRL), serves as a model of
learning in biological neural networks mimicking reward pathways in the
mammalian brain.

我们提出了一种新颖的强化学习算法，名为实时循环强化学习 (RTRRL)，通过利用随机反馈局部在线学习 (RFLO) 近似实时递归学习 (RTRL) 来计算循环神经网络参数的梯度，并结合具有资格迹的时序差分强化学习 (TD (λ))，能在部分可观测马尔可夫决策过程 (POMDPs) 中解决离散和连续控制任务，达到生物可行并超越了传统的时间反向传播算法 (BPTT)。该方法模拟哺乳动物大脑奖励途径的生物神经网络进行学习。