We quantify the efficiency of temporal difference (TD) learning over the
direct, or Monte Carlo (MC), estimator for policy evaluation in reinforcement
learning, with an emphasis on estimation of quantities related to rare events.
Policy evaluation is complicated in the rare event setting by the long
timescale of the event and by the need for \emph{relative accuracy} in
estimates of very small values. Specifically, we focus on least-squares TD
(LSTD) prediction for finite state Markov chains, and show that LSTD can
achieve relative accuracy far more efficiently than MC. We prove a central
limit theorem for the LSTD estimator and upper bound the \emph{relative
asymptotic variance} by simple quantities characterizing the connectivity of
states relative to the transition probabilities between them. Using this bound,
we show that, even when both the timescale of the rare event and the relative
accuracy of the MC estimator are exponentially large in the number of states,
LSTD maintains a fixed level of relative accuracy with a total number of
observed transitions of the Markov chain that is only \emph{polynomially} large
in the number of states.

我们定量地评估了强化学习中政策评估的时间差异（TD）学习与直接或蒙特卡罗（MC）估计器的效率，重点在于对罕见事件的相关数量的估计。我们证明了有限状态马尔可夫链中最小二乘 TD（LSTD）预测相较于 MC 能够更有效地实现相对准确性，并且通过简单的数量来验证了 LSTD 估计器的中心极限定理和相对渐近方差的上界。利用这个界限，我们证明了即使在罕见事件的时间尺度和 MC 估计器的相对准确性对于状态数都是指数级增长的情况下，LSTD 仍然能够以仅与状态数呈多项式级增长的马尔可夫链观测转换总数维持固定水平的相对准确性。