We propose a new Q-learning variant, called 2RA Q-learning, that addresses
some weaknesses of existing Q-learning methods in a principled manner. One such
weakness is an underlying estimation bias which cannot be controlled and often
results in poor performance. We propose a distributionally robust estimator for
the maximum expected value term, which allows us to precisely control the level
of estimation bias introduced. The distributionally robust estimator admits a
closed-form solution such that the proposed algorithm has a computational cost
per iteration comparable to Watkins' Q-learning. For the tabular case, we show
that 2RA Q-learning converges to the optimal policy and analyze its asymptotic
mean-squared error. Lastly, we conduct numerical experiments for various
settings, which corroborate our theoretical findings and indicate that 2RA
Q-learning often performs better than existing methods.

我们提出了一种新的 Q 学习变体，称为 2RA Q 学习，以有原则的方式解决现有 Q 学习方法的一些弱点。我们对最大期望值项提出了鲁棒分布估计器，从而可以精确控制引入的估计偏差水平。分布鲁棒估计器具备闭合解，因此所提出的算法每次迭代的计算成本与 Watkins 的 Q 学习可比。对于表格情况，我们证明 2RA Q 学习收敛到最优策略，并分析其渐近均方误差。最后，我们进行了各种设置的数值实验，证实了我们的理论发现，并表明 2RA Q 学习通常优于现有方法。