Extrinsic rewards can effectively guide reinforcement learning (RL) agents in
specific tasks. However, extrinsic rewards frequently fall short in complex
environments due to the significant human effort needed for their design and
annotation. This limitation underscores the necessity for intrinsic rewards,
which offer auxiliary and dense signals and can enable agents to learn in an
unsupervised manner. Although various intrinsic reward formulations have been
proposed, their implementation and optimization details are insufficiently
explored and lack standardization, thereby hindering research progress. To
address this gap, we introduce RLeXplore, a unified, highly modularized, and
plug-and-play framework offering reliable implementations of eight
state-of-the-art intrinsic reward algorithms. Furthermore, we conduct an
in-depth study that identifies critical implementation details and establishes
well-justified standard practices in intrinsically-motivated RL. The source
code for RLeXplore is available at this https URL

在复杂环境中，由于设计和注释的高人力成本，外部奖励通常无法满足需求，这强调了内在奖励的必要性，通过提供辅助和密集的信号使代理能够无监督学习。本研究引入了一个统一的、高度模块化且可插拔的框架 RLeXplore，提供了八种先进内在奖励算法的可靠实现，并通过深入研究确定了关键的实现细节和合理的标准实践，填补了相关领域的研究空白。

RLeXplore: 加速内在动机驱动的强化学习研究

RLeXplore: Accelerating Research in Intrinsically-Motivated  Reinforcement Learning

Offline reinforcement learning (RL) aims to learn an optimal policy from
pre-collected and labeled datasets, which eliminates the time-consuming data
collection in online RL. However, offline RL still bears a large burden of
specifying/handcrafting extrinsic rewards for each transition in the offline
data. As a remedy for the labor-intensive labeling, we propose to endow offline
RL tasks with a few expert data and utilize the limited expert data to drive
intrinsic rewards, thus eliminating the need for extrinsic rewards. To achieve
that, we introduce \textbf{C}alibrated \textbf{L}atent
g\textbf{U}idanc\textbf{E} (CLUE), which utilizes a conditional variational
auto-encoder to learn a latent space such that intrinsic rewards can be
directly qualified over the latent space. CLUE's key idea is to align the
intrinsic rewards consistent with the expert intention via enforcing the
embeddings of expert data to a calibrated contextual representation. We
instantiate the expert-driven intrinsic rewards in sparse-reward offline RL
tasks, offline imitation learning (IL) tasks, and unsupervised offline RL
tasks. Empirically, we find that CLUE can effectively improve the sparse-reward
offline RL performance, outperform the state-of-the-art offline IL baselines,
and discover diverse skills from static reward-free offline data.

本文提出了一种基于专家数据提取内在奖励的方法，该方法利用了 Calibrated Latent Guidance (CLUE) 来消除离线 RL 中需要手动指定外部奖励的步骤，并在不同的离线 RL 任务中取得了良好效果。