Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is $\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-\gamma)^{-5}\varepsilon^{-2})$. Furthermore, we show that this sample complexity is improved to $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-5}\varepsilon^{-2})$ if we can sequentially update all state-action pairs and $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-\gamma)^{-4}\varepsilon^{-2})$ if $\gamma$ is further in $(1/2, 1)$. Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.

本文研究了利用目标网络解决深度强化学习Q学习算法在非目标策略情况下发散的问题，使用生成式神经模型研究了其样本复杂度，发现目标学习算法的样本复杂度是以1-γ为主导的，并且证明了引入周期性目标Q函数网络不会牺牲样本复杂度。

目标Q学习关于使用生成式Oracle求解有限MDPs的说明