The inverse reinforcement learning approach to imitation learning is a
double-edged sword. On the one hand, it can enable learning from a smaller
number of expert demonstrations with more robustness to error compounding than
behavioral cloning approaches. On the other hand, it requires that the learner
repeatedly solve a computationally expensive reinforcement learning (RL)
problem. Often, much of this computation is wasted searching over policies very
dissimilar to the expert's. In this work, we propose using hybrid RL --
training on a mixture of online and expert data -- to curtail unnecessary
exploration. Intuitively, the expert data focuses the learner on good states
during training, which reduces the amount of exploration required to compute a
strong policy. Notably, such an approach doesn't need the ability to reset the
learner to arbitrary states in the environment, a requirement of prior work in
efficient inverse RL. More formally, we derive a reduction from inverse RL to
expert-competitive RL (rather than globally optimal RL) that allows us to
dramatically reduce interaction during the inner policy search loop while
maintaining the benefits of the IRL approach. This allows us to derive both
model-free and model-based hybrid inverse RL algorithms with strong policy
performance guarantees. Empirically, we find that our approaches are
significantly more sample efficient than standard inverse RL and several other
baselines on a suite of continuous control tasks.

我们提出使用混合增强学习的方法来减少反向强化学习中不必要的探索，通过专家数据在训练过程中引导学习者，从而缩小小型逆强化学习问题的交互过程，取得了较好的策略表现。

混合逆强化学习

Hybrid Inverse Reinforcement Learning

The theories of offline and online reinforcement learning, despite having
evolved in parallel, have begun to show signs of the possibility for a
unification, with algorithms and analysis techniques for one setting often
having natural counterparts in the other. However, the notion of density ratio
modeling, an emerging paradigm in offline RL, has been largely absent from
online RL, perhaps for good reason: the very existence and boundedness of
density ratios relies on access to an exploratory dataset with good coverage,
but the core challenge in online RL is to collect such a dataset without having
one to start. In this work we show -- perhaps surprisingly -- that density
ratio-based algorithms have online counterparts. Assuming only the existence of
an exploratory distribution with good coverage, a structural condition known as
coverability (Xie et al., 2023), we give a new algorithm (GLOW) that uses
density ratio realizability and value function realizability to perform
sample-efficient online exploration. GLOW addresses unbounded density ratios
via careful use of truncation, and combines this with optimism to guide
exploration. GLOW is computationally inefficient; we complement it with a more
efficient counterpart, HyGLOW, for the Hybrid RL setting (Song et al., 2022)
wherein online RL is augmented with additional offline data. HyGLOW is derived
as a special case of a more general meta-algorithm that provides a provable
black-box reduction from hybrid RL to offline RL, which may be of independent
interest.

该论文介绍了离线强化学习和在线强化学习的统一理论以及密度比建模在在线强化学习中的存在，并提出了 GLOW 算法和 HyGLOW 算法作为在线探索的方法。

利用密度比例进行在线强化学习

Harnessing Density Ratios for Online Reinforcement Learning

Hybrid RL is the setting where an RL agent has access to both offline data
and online data by interacting with the real-world environment. In this work,
we propose a new hybrid RL algorithm that combines an on-policy actor-critic
method with offline data. On-policy methods such as policy gradient and natural
policy gradient (NPG) have shown to be more robust to model misspecification,
though sometimes it may not be as sample efficient as methods that rely on
off-policy learning. On the other hand, offline methods that depend on
off-policy training often require strong assumptions in theory and are less
stable to train in practice. Our new approach integrates a procedure of
off-policy training on the offline data into an on-policy NPG framework. We
show that our approach, in theory, can obtain a best-of-both-worlds type of
result -- it achieves the state-of-art theoretical guarantees of offline RL
when offline RL-specific assumptions hold, while at the same time maintaining
the theoretical guarantees of on-policy NPG regardless of the offline RL
assumptions' validity. Experimentally, in challenging rich-observation
environments, we show that our approach outperforms a state-of-the-art hybrid
RL baseline which only relies on off-policy policy optimization, demonstrating
the empirical benefit of combining on-policy and off-policy learning. Our code
is publicly available at this https URL

融合强化学习是指强化学习智能体能够同时访问离线数据和与真实环境进行交互的在线数据。本文提出了一种新的融合强化学习算法，它将基于策略的演员 - 评论家方法与离线数据相结合。理论上，我们的方法在离线强化学习特定假设成立时可以获得最佳结果，同时无论离线强化学习假设的有效性如何，仍然保持基于策略的演员 - 评论家方法的理论保证。实验结果表明，在具有挑战性的富观测环境中，我们的方法优于仅依赖于离线策略优化的最先进融合强化学习基准模型，证明了将基于策略和离线学习相结合的实证优势。

离线数据增强的有保证的在线策略梯度

Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees

Mobile Manipulation (MM) systems are ideal candidates for taking up the role
of a personal assistant in unstructured real-world environments. Among other
challenges, MM requires effective coordination of the robot's embodiments for
executing tasks that require both mobility and manipulation. Reinforcement
Learning (RL) holds the promise of endowing robots with adaptive behaviors, but
most methods require prohibitively large amounts of data for learning a useful
control policy. In this work, we study the integration of robotic reachability
priors in actor-critic RL methods for accelerating the learning of MM for
reaching and fetching tasks. Namely, we consider the problem of optimal base
placement and the subsequent decision of whether to activate the arm for
reaching a 6D target. For this, we devise a novel Hybrid RL method that handles
discrete and continuous actions jointly, resorting to the Gumbel-Softmax
reparameterization. Next, we train a reachability prior using data from the
operational robot workspace, inspired by classical methods. Subsequently, we
derive Boosted Hybrid RL (BHyRL), a novel algorithm for learning Q-functions by
modeling them as a sum of residual approximators. Every time a new task needs
to be learned, we can transfer our learned residuals and learn the component of
the Q-function that is task-specific, hence, maintaining the task structure
from prior behaviors. Moreover, we find that regularizing the target policy
with a prior policy yields more expressive behaviors. We evaluate our method in
simulation in reaching and fetching tasks of increasing difficulty, and we show
the superior performance of BHyRL against baseline methods. Finally, we
zero-transfer our learned 6D fetching policy with BHyRL to our MM robot
TIAGo++. For more details and code release, please refer to our project site:
this http URL

本文提出了一种混合强化学习算法和机器人可达性先验方法，加速了移动操作系统的学习速度，提高了机器人应对现实环境中的任务的表现。