Q-learning is a stochastic approximation version of the classic value iteration. The literature has established that Q-learning suffers from both maximization bias and slower convergence. Recently, multi-step algorithms have shown practical advantages over existing methods. This paper proposes a novel off-policy two-step Q-learning algorithms, without importance sampling. With suitable assumption it was shown that, iterates in the proposed two-step Q-learning is bounded and converges almost surely to the optimal Q-values. This study also address the convergence analysis of the smooth version of two-step Q-learning, i.e., by replacing max function with the log-sum-exp function. The proposed algorithms are robust and easy to implement. Finally, we test the proposed algorithms on benchmark problems such as the roulette problem, maximization bias problem, and randomly generated Markov decision processes and compare it with the existing methods available in literature. Numerical experiments demonstrate the superior performance of both the two-step Q-learning and its smooth variants.

该研究提出了一种新的无偏置、无重要性采样的两步离策略Q学习算法，并通过适当的假设证明，该算法的迭代是有界的，并且几乎肯定收敛于最优Q值。研究还探讨了两步Q学习的平滑版本的收敛性分析，即通过用对数-和-指数函数代替最大函数。该算法具有鲁棒性和易于实现性，并在基准问题上进行了实验验证，如轮盘问题、最大化偏置问题和随机生成的马尔可夫决策过程，并将其与现有文献中的方法进行了比较。数值实验证明了两步Q学习及其平滑变体的卓越性能。

二步Q-Learning