We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity in terms of a drop in the rank of the learned value network features, and show that this corresponds to a drop in performance. We demonstrate this phenomenon on widely studies domains, including Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse improves performance.

利用神经网络近似值函数的基于价值的深度强化学习方法存在隐含的欠参数化现象，该现象会通过下降学习到的价值网络特征的排名导致性能下降，控制特征排名的崩溃可以缓解这一现象并改善性能。

深度强化学习的数据效率受到参数欠约束的抑制