Multi-step learning applies lookahead over multiple time steps and has proved
valuable in policy evaluation settings. However, in the optimal control case,
the impact of multi-step learning has been relatively limited despite a number
of prior efforts. Fundamentally, this might be because multi-step policy
improvements require operations that cannot be approximated by stochastic
samples, hence hindering the widespread adoption of such methods in practice.
To address such limitations, we introduce doubly multi-step off-policy VI
(DoMo-VI), a novel oracle algorithm that combines multi-step policy
improvements and policy evaluations. DoMo-VI enjoys guaranteed convergence
speed-up to the optimal policy and is applicable in general off-policy learning
settings. We then propose doubly multi-step off-policy actor-critic (DoMo-AC),
a practical instantiation of the DoMo-VI algorithm. DoMo-AC introduces a
bias-variance trade-off that ensures improved policy gradient estimates. When
combined with the IMPALA architecture, DoMo-AC has showed improvements over the
baseline algorithm on Atari-57 game benchmarks.

介绍了一种新方法 doubly multi-step off-policy VI (DoMo-VI) 和其实例 doubly multi-step off-policy actor-critic (DoMo-AC)，通过结合 policy improvement 和 policy evaluation 技术使模型训练更快、更准确，并在 Atari-57 游戏基准测试中得到比基线算法更好的结果。

DoMo-AC: 双重多步骤离线 Actor-Critic 算法

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

To estimate the value functions of policies from exploratory data, most
model-free off-policy algorithms rely on importance sampling, where the use of
importance sampling ratios often leads to estimates with severe variance. It is
thus desirable to learn off-policy without using the ratios. However, such an
algorithm does not exist for multi-step learning with function approximation.
In this paper, we introduce the first such algorithm based on
temporal-difference (TD) learning updates. We show that an explicit use of
importance sampling ratios can be eliminated by varying the amount of
bootstrapping in TD updates in an action-dependent manner. Our new algorithm
achieves stability using a two-timescale gradient-based TD update. A prior
algorithm based on lookup table representation called Tree Backup can also be
retrieved using action-dependent bootstrapping, becoming a special case of our
algorithm. In two challenging off-policy tasks, we demonstrate that our
algorithm is stable, effectively avoids the large variance issue, and can
perform substantially better than its state-of-the-art counterpart.

本文提出了一种基于时序差分学习更新的无需使用重要性采样比率来学习无政策的多步学习的算法。通过变化 TD 更新中的自举量来消除重要性采样比率，该算法使用了两个时间尺度的梯度 TD 更新以实现稳定性，而且该算法的表现优于现有算法。