In this paper, we investigate an offline reinforcement learning (RL) problem
where datasets are collected from two domains. In this scenario, having
datasets with domain labels facilitates efficient policy training. However, in
practice, the task of assigning domain labels can be resource-intensive or
infeasible at a large scale, leading to a prevalence of domain-unlabeled data.
To formalize this challenge, we introduce a novel offline RL problem setting
named Positive-Unlabeled Offline RL (PUORL), which incorporates
domain-unlabeled data. To address PUORL, we develop an offline RL algorithm
utilizing positive-unlabeled learning to predict the domain labels of
domain-unlabeled data, enabling the integration of this data into policy
training. Our experiments show the effectiveness of our method in accurately
identifying domains and learning policies that outperform baselines in the
PUORL setting, highlighting its capability to leverage domain-unlabeled data
effectively.

通过开发一种融合正负无标签学习的离线强化学习算法，该研究论文针对域未标记数据的挑战，有效地识别领域并学习优于基准的策略，以实现域未标记数据的有效利用。

跨两个领域利用无标签领域数据的离线增强学习

Leveraging Domain-Unlabeled Data in Offline Reinforcement Learning  across Two Domains

Inverse Reinforcement Learning (IRL) is a powerful framework for learning
complex behaviors from expert demonstrations. However, it traditionally
requires repeatedly solving a computationally expensive reinforcement learning
(RL) problem in its inner loop. It is desirable to reduce the exploration
burden by leveraging expert demonstrations in the inner-loop RL. As an example,
recent work resets the learner to expert states in order to inform the learner
of high-reward expert states. However, such an approach is infeasible in the
real world. In this work, we consider an alternative approach to speeding up
the RL subroutine in IRL: \emph{pessimism}, i.e., staying close to the expert's
data distribution, instantiated via the use of offline RL algorithms. We
formalize a connection between offline RL and IRL, enabling us to use an
arbitrary offline RL algorithm to improve the sample efficiency of IRL. We
validate our theory experimentally by demonstrating a strong correlation
between the efficacy of an offline RL algorithm and how well it works as part
of an IRL procedure. By using a strong offline RL algorithm as part of an IRL
procedure, we are able to find policies that match expert performance
significantly more efficiently than the prior art.

通过使用离线 RL 算法作为 IRL 过程的一部分，我们能够更有效地找到与专家表现相匹配的策略。

逆强化学习中悲观主义的优点

The Virtues of Pessimism in Inverse Reinforcement Learning

Controlling agents remotely with deep reinforcement learning~(DRL) in the
real world is yet to come. One crucial stepping stone is to devise RL
algorithms that are robust in the face of dropped information from corrupted
communication or malfunctioning sensors. Typical RL methods usually require
considerable online interaction data that are costly and unsafe to collect in
the real world. Furthermore, when applying to the frame dropping scenarios,
they perform unsatisfactorily even with moderate drop rates. To address these
issues, we propose Decision Transformer under Random Frame Dropping~(DeFog), an
offline RL algorithm that enables agents to act robustly in frame dropping
scenarios without online interaction. DeFog first randomly masks out data in
the offline datasets and explicitly adds the time span of frame dropping as
inputs. After that, a finetuning stage on the same offline dataset with a
higher mask rate would further boost the performance. Empirical results show
that DeFog outperforms strong baselines under severe frame drop rates like
90\%, while maintaining similar returns under non-frame-dropping conditions in
the regular MuJoCo control benchmarks and the Atari environments. Our approach
offers a robust and deployable solution for controlling agents in real-world
environments with limited or unreliable data.

本文提出了一种名为 “基于随机帧丢失的决策变换器” 的离线强化学习算法，可以使代理在帧丢失场景下稳健地行动，同时不需要在线交互数据，该算法通过随机掩蔽离线数据并显式地添加帧丢失的时间跨度作为输入，使用相同的离线数据集进行微调，可以在严重的帧丢失率下优于强基线，同时在常规的 MuJoCo 控制基准和 Atari 环境中具有相似的回报。该方法提供了一个稳健和可部署的解决方案，以控制在具有有限或不可靠数据的实际环境中的代理。

随机帧丢失下的决策变换器

Decision Transformer under Random Frame Dropping

Many reinforcement learning (RL) problems in practice are offline, learning
purely from observational data. A key challenge is how to ensure the learned
policy is safe, which requires quantifying the risk associated with different
actions. In the online setting, distributional RL algorithms do so by learning
the distribution over returns (i.e., cumulative rewards) instead of the
expected return; beyond quantifying risk, they have also been shown to learn
better representations for planning. We propose Conservative Offline
Distributional Actor Critic (CODAC), an offline RL algorithm suitable for both
risk-neutral and risk-averse domains. CODAC adapts distributional RL to the
offline setting by penalizing the predicted quantiles of the return for
out-of-distribution actions. We prove that CODAC learns a conservative return
distribution -- in particular, for finite MDPs, CODAC converges to an uniform
lower bound on the quantiles of the return distribution; our proof relies on a
novel analysis of the distributional Bellman operator. In our experiments, on
two challenging robot navigation tasks, CODAC successfully learns risk-averse
policies using offline data collected purely from risk-neutral agents.
Furthermore, CODAC is state-of-the-art on the D4RL MuJoCo benchmark in terms of
both expected and risk-sensitive performance.

提出了一种适用于风险中性和风险厌恶领域的离线强化学习算法 CODAC，通过对预测收益分位数的度量来适应分布式强化学习，证明 CODAC 学习一个保守收益分布，并在机器人导航任务上成功地学习了风险厌恶策略，表现优于 D4RL MuJoCo 基准测试的方法。