Recent advancements in deep reinforcement learning (RL) have demonstrated
notable progress in sample efficiency, spanning both model-based and model-free
paradigms. Despite the identification and mitigation of specific bottlenecks in
prior works, the agent's exploration ability remains under-emphasized in the
realm of sample-efficient RL. This paper investigates how to achieve
sample-efficient exploration in continuous control tasks. We introduce an RL
algorithm that incorporates a predictive model and off-policy learning
elements, where an online planner enhanced by a novelty-aware terminal value
function is employed for sample collection. Leveraging the forward predictive
error within a latent state space, we derive an intrinsic reward without
incurring parameters overhead. This reward establishes a solid connection to
model uncertainty, allowing the agent to effectively overcome the asymptotic
performance gap. Through extensive experiments, our method shows competitive or
even superior performance compared to prior works, especially the sparse reward
cases.

通过引入预测模型和离线学习元素，结合一个实用性较高的终端价值函数，本文研究了如何在连续控制任务中实现样本高效的探索能力。通过利用潜在状态空间内的前向预测误差，我们得出了一种不引入额外参数的固有奖励。该奖励与模型不确定性有强烈的关联，使得智能体能够有效地克服渐进性能差距。通过广泛的实验证明，我们的方法在与以往工作的比较中表现出有竞争力的甚至更优异的性能，尤其是在稀疏奖励的情况下。

基于模型内在动机的离策略学习与主动在线探索

Learning Off-policy with Model-based Intrinsic Motivation For Active  Online Exploration

We propose the GFlowNets with Human Feedback (GFlowHF) framework to improve
the exploration ability when training AI models. For tasks where the reward is
unknown, we fit the reward function through human evaluations on different
trajectories. The goal of GFlowHF is to learn a policy that is strictly
proportional to human ratings, instead of only focusing on human favorite
ratings like RLHF. Experiments show that GFlowHF can achieve better exploration
ability than RLHF.

提出了使用人类反馈的 GFlowNets 框架来提高 AI 模型训练的探索能力，使用人类对不同轨迹的评估拟合奖励函数，目标是学习一个严格按照人类评级比例的策略，而非仅关注人类喜爱评级的 RLHF，实验证明 GFlowHF 比 RLHF 的探索能力更好。

GFlowNets 与人类反馈

GFlowNets with Human Feedback

Goal-conditioned hierarchical reinforcement learning (GCHRL) provides a
promising approach to solving long-horizon tasks. Recently, its success has
been extended to more general settings by concurrently learning hierarchical
policies and subgoal representations. Although GCHRL possesses superior
exploration ability by decomposing tasks via subgoals, existing GCHRL methods
struggle in temporally extended tasks with sparse external rewards, since the
high-level policy learning relies on external rewards. As the high-level policy
selects subgoals in an online learned representation space, the dynamic change
of the subgoal space severely hinders effective high-level exploration. In this
paper, we propose a novel regularization that contributes to both stable and
efficient subgoal representation learning. Building upon the stable
representation, we design measures of novelty and potential for subgoals, and
develop an active hierarchical exploration strategy that seeks out new
promising subgoals and states without intrinsic rewards. Experimental results
show that our approach significantly outperforms state-of-the-art baselines in
continuous control tasks with sparse rewards.

本文提出了一种新的规范化方法来提高子目标表示的稳定性和效率，并设计了一种主动式分层探索策略来寻找没有内在奖励的新有前途的子目标和状态，实验结果表明，我们的方法在具有稀疏奖励的连续控制任务中显著优于最先进的基线算法。