At the working heart of policy iteration algorithms commonly used and studied
in the discounted setting of reinforcement learning, the policy evaluation step
estimates the value of states with samples from a Markov reward process induced
by following a Markov policy in a Markov decision process. We propose a simple
and efficient estimator called loop estimator that exploits the regenerative
structure of Markov reward processes without explicitly estimating a full
model. Our method enjoys a space complexity of $O(1)$ when estimating the value
of a single positive recurrent state $s$ unlike TD with $O(S)$ or model-based
methods with $O\left(S^2\right)$. Moreover, the regenerative structure enables
us to show, without relying on the generative model approach, that the
estimator has an instance-dependent convergence rate of
$\widetilde{O}\left(\sqrt{\tau_s/T}\right)$ over steps $T$ on a single sample
path, where $\tau_s$ is the maximal expected hitting time to state $s$. In
preliminary numerical experiments, the loop estimator outperforms model-free
methods, such as TD(k), and is competitive with the model-based estimator.

研究怎样使用所提出的 Loop estimator 算法优化 Policy iteration 算法中的 Policy evaluation 步骤，实现有效的、具有强大空间和收敛性的单状态 s 值计算，以精确地评估 MDP 中的状态价值。