Despite overparameterization, deep networks trained via supervised learning
are easy to optimize and exhibit excellent generalization. One hypothesis to
explain this is that overparameterized deep networks enjoy the benefits of
implicit regularization induced by stochastic gradient descent, which favors
parsimonious solutions that generalize well on test inputs. It is reasonable to
surmise that deep reinforcement learning (RL) methods could also benefit from
this effect. In this paper, we discuss how the implicit regularization effect
of SGD seen in supervised learning could in fact be harmful in the offline deep
RL setting, leading to poor generalization and degenerate feature
representations. Our theoretical analysis shows that when existing models of
implicit regularization are applied to temporal difference learning, the
resulting derived regularizer favors degenerate solutions with excessive
"aliasing", in stark contrast to the supervised learning case. We back up these
findings empirically, showing that feature representations learned by a deep
network value function trained via bootstrapping can indeed become degenerate,
aliasing the representations for state-action pairs that appear on either side
of the Bellman backup. To address this issue, we derive the form of this
implicit regularizer and, inspired by this derivation, propose a simple and
effective explicit regularizer, called DR3, that counteracts the undesirable
effects of this implicit regularizer. When combined with existing offline RL
methods, DR3 substantially improves performance and stability, alleviating
unlearning in Atari 2600 games, D4RL domains and robotic manipulation from
images.

本研究探讨了隐式正则化在深度增强学习中的应用。我们的分析表明，隐式正则化可能会导致总体泛化性能下降和特征表示的变形。这篇论文通过提出一种基于 DR3 的新正则化方法来解决这个隐式正则化问题，并在 Atari 2600 游戏、D4RL 领域和从图像中学习的机器人操作等领域取得了良好的性能和稳定性。