Exploration is a crucial and distinctive aspect of reinforcement learning (RL) that remains a fundamental open problem. Several methods have been proposed to tackle this challenge. Commonly used methods inject random noise directly into the actions, indirectly via entropy maximization, or add intrinsic rewards that encourage the agent to steer to novel regions of the state space. Another previously seen idea is to use the Bellman error as a separate optimization objective for exploration. In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy. Our separate exploration agent is informed about the state of the exploitation, thus enabling it to account for previous experiences. Further components are introduced to make the exploration objective agnostic toward the episode length and to mitigate instability introduced by far-off-policy learning. Our experimental results show that our approach can outperform $\varepsilon$-greedy in dense and sparse reward settings.

本研究针对强化学习中探索问题的挑战，提出了一种新的体系结构，通过对贝尔曼误差进行稳定优化，以实现确定性探索策略。我们的方法不仅使用以前的经验来优化探索过程，还为探索目标引入了与试验长度无关的策略，从而在稠密和稀疏奖励环境中超越了传统的ε-greedy策略。

通过静态贝尔曼误差最大化实现确定性探索