A promising technique for exploration is to maximize the entropy of visited
state distribution, i.e., state entropy, by encouraging uniform coverage of
visited state space. While it has been effective for an unsupervised setup, it
tends to struggle in a supervised setup with a task reward, where an agent
prefers to visit high-value states to exploit the task reward. Such a
preference can cause an imbalance between the distributions of high-value
states and low-value states, which biases exploration towards low-value state
regions as a result of the state entropy increasing when the distribution
becomes more uniform. This issue is exacerbated when high-value states are
narrowly distributed within the state space, making it difficult for the agent
to complete the tasks. In this paper, we present a novel exploration technique
that maximizes the value-conditional state entropy, which separately estimates
the state entropies that are conditioned on the value estimates of each state,
then maximizes their average. By only considering the visited states with
similar value estimates for computing the intrinsic bonus, our method prevents
the distribution of low-value states from affecting exploration around
high-value states, and vice versa. We demonstrate that the proposed alternative
to the state entropy baseline significantly accelerates various reinforcement
learning algorithms across a variety of tasks within MiniGrid, DeepMind Control
Suite, and Meta-World benchmarks. Source code is available at
this https URL

本文提出了一种基于价值条件下的状态熵探索技术，该技术通过最大化条件价值估计的状态熵的平均值，分别估计每个状态的状态熵，再通过只考虑状态价值估计相似的访问状态来计算内在奖励，从而避免了低价值状态分布影响高价值状态周围的探索，加速了多种 RL 算法在各种任务中的表现。

使用值条件状态熵探索加速强化学习

Accelerating Reinforcement Learning with Value-Conditional State Entropy  Exploration

Recent unsupervised pre-training methods have shown to be effective on
language and vision domains by learning useful representations for multiple
downstream tasks. In this paper, we investigate if such unsupervised
pre-training methods can also be effective for vision-based reinforcement
learning (RL). To this end, we introduce a framework that learns
representations useful for understanding the dynamics via generative
pre-training on videos. Our framework consists of two phases: we pre-train an
action-free latent video prediction model, and then utilize the pre-trained
representations for efficiently learning action-conditional world models on
unseen environments. To incorporate additional action inputs during
fine-tuning, we introduce a new architecture that stacks an action-conditional
latent prediction model on top of the pre-trained action-free prediction model.
Moreover, for better exploration, we propose a video-based intrinsic bonus that
leverages pre-trained representations. We demonstrate that our framework
significantly improves both final performances and sample-efficiency of
vision-based RL in a variety of manipulation and locomotion tasks. Code is
available at this https URL.

本文介绍了一种通过生成式预训练学习得到的视觉表示，用于有效地加速并提高多种任务下视觉增强学习系统性能和效率的框架。我们在视频数据上预训练了一个无动作潜在视频预测模型，并将这些表示用于未知环境下的学习操作条件下的世界模型。我们还引入了一个新的架构，该架构在预训练的无动作预测模型的基础上堆叠了一个动作条件潜在预测模型，以更好地实现探索。同时也提出了基于视频的内在激励奖励机制，利用预训练表示的优势，有效提升了数据利用率和最终权能的完成度。