State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., $\epsilon$-greedy) for exploration, but this method fails in hard exploration tasks like Montezuma's Revenge. To address the challenge of exploration, prior works incentivize the agent to visit novel states using an exploration bonus (also called an intrinsic reward or curiosity). Such methods can lead to excellent results on hard exploration tasks but can suffer from intrinsic reward bias and underperform when compared to an agent trained using only task rewards. This performance decrease occurs when an agent seeks out intrinsic rewards and performs unnecessary exploration even when sufficient task reward is available. This inconsistency in performance across tasks prevents the widespread use of intrinsic rewards with RL algorithms. We propose a principled constrained policy optimization procedure that automatically tunes the importance of the intrinsic reward: it suppresses the intrinsic reward when exploration is unnecessary and increases it when exploration is required. This results in superior exploration that does not require manual tuning to balance the intrinsic reward against the task reward. Consistent performance gains across sixty-one ATARI games validate our claim. The code is available at https://github.com/Improbable-AI/eipo.

该研究提出了一种名为EIPO的优化策略，通过自动调整内在奖励的重要性来平衡任务奖励和内在奖励的关系，以获得最佳探索结果。经过在61个ATARI游戏中的测试，表现优异。

通过受限制优化提升内在奖励