Model-free reinforcement learning algorithms, such as Q-learning, perform
poorly in the early stages of learning in noisy environments, because much
effort is spent unlearning biased estimates of the state-action value function.
The bias results from selecting, among several noisy estimates, the apparent
optimum, which may actually be suboptimal. We propose G-learning, a new
off-policy learning algorithm that regularizes the value estimates by
penalizing deterministic policies in the beginning of the learning process. We
show that this method reduces the bias of the value-function estimation,
leading to faster convergence to the optimal value and the optimal policy.
Moreover, G-learning enables the natural incorporation of prior domain
knowledge, when available. The stochastic nature of G-learning also makes it
avoid some exploration costs, a property usually attributed only to on-policy
algorithms. We illustrate these ideas in several examples, where G-learning
results in significant improvements of the convergence rate and the cost of the
learning process.

提出 G-learning 算法用于强化学习领域，该算法通过对决策策略进行惩罚，实现了对值函数估计的减少偏差，从而在学习过程的初期能够有更快的收敛速度并降低学习成本。