Emphatic algorithms are temporal-difference learning algorithms that change
their effective state distribution by selectively emphasizing and
de-emphasizing their updates on different time steps. Recent works by Sutton,
Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a
particular way, these algorithms become stable and convergent under off-policy
training with linear function approximation. This paper serves as a unified
summary of the available results from both works. In addition, we demonstrate
the empirical benefits from the flexibility of emphatic algorithms, including
state-dependent discounting, state-dependent bootstrapping, and the
user-specified allocation of function approximation resources.

该研究概括了近期两个关于强化学习中强调算法的稳定性和收敛性的研究，同时展示了强调算法的灵活性在状态折扣、状态引导和资源分布等方面的经验优势。

强调时序差分学习

Emphatic Temporal-Difference Learning

We introduce a generalization of temporal-difference (TD) learning to
networks of interrelated predictions. Rather than relating a single prediction
to itself at a later time, as in conventional TD methods, a TD network relates
each prediction in a set of predictions to other predictions in the set at a
later time. TD networks can represent and apply TD learning to a much wider
class of predictions than has previously been possible. Using a random-walk
example, we show that these networks can be used to learn to predict by a fixed
interval, which is not possible with conventional TD methods. Secondly, we show
that if the inter-predictive relationships are made conditional on action, then
the usual learning-efficiency advantage of TD methods over Monte Carlo
(supervised learning) methods becomes particularly pronounced. Thirdly, we
demonstrate that TD networks can learn predictive state representations that
enable exact solution of a non-Markov problem. A very broad range of
inter-predictive temporal relationships can be expressed in these networks.
Overall we argue that TD networks represent a substantial extension of the
abilities of TD methods and bring us closer to the goal of representing world
knowledge in entirely predictive, grounded terms.

介绍了一种将时间差异（TD）学习推广到相互关联预测网络的方法， TD 网络能够表示和应用 TD 学习到比以前更广泛的预测类别，并通过将预测之间的关系作为条件来提高学习效率，此外，还演示了 TD 网络可以学习预测状态表示，成为 TD 方法能力的实质性扩展之一，带我们更加接近以完全预测和基于经验的方式表达世界知识的目标。

时序差分网络

Temporal-Difference Networks

This paper presents the first actor-critic algorithm for off-policy
reinforcement learning. Our algorithm is online and incremental, and its
per-time-step complexity scales linearly with the number of learned weights.
Previous work on actor-critic algorithms is limited to the on-policy setting
and does not take advantage of the recent advances in off-policy gradient
temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable
a target policy to be learned while following and obtaining data from another
(behavior) policy. For many problems, however, actor-critic methods are more
practical than action value methods (like Greedy-GQ) because they explicitly
represent the policy; consequently, the policy can be stochastic and utilize a
large action space. In this paper, we illustrate how to practically combine the
generality and learning potential of off-policy learning with the flexibility
in action selection given by actor-critic methods. We derive an incremental,
linear time and space complexity algorithm that includes eligibility traces,
prove convergence under assumptions similar to previous off-policy algorithms,
and empirically show better or comparable performance to existing algorithms on
standard reinforcement-learning benchmark problems.

本研究提出了一种在线的增量式 actor-critic 算法来应对现实生活中的多种问题，在采用 off-policy 学习和最新的 gradient temporal-difference 技术的同时，能够灵活地运用 policy 设计，具有较强的学习潜力和泛化性能，并能收敛至较好的算法性能。