While imitation learning requires access to high-quality data, offline
reinforcement learning (RL) should, in principle, perform similarly or better
with substantially lower data quality by using a value function. However,
current results indicate that offline RL often performs worse than imitation
learning, and it is often unclear what holds back the performance of offline
RL. Motivated by this observation, we aim to understand the bottlenecks in
current offline RL algorithms. While poor performance of offline RL is
typically attributed to an imperfect value function, we ask: is the main
bottleneck of offline RL indeed in learning the value function, or something
else? To answer this question, we perform a systematic empirical study of (1)
value learning, (2) policy extraction, and (3) policy generalization in offline
RL problems, analyzing how these components affect performance. We make two
surprising observations. First, we find that the choice of a policy extraction
algorithm significantly affects the performance and scalability of offline RL,
often more so than the value learning objective. For instance, we show that
common value-weighted behavioral cloning objectives (e.g., AWR) do not fully
leverage the learned value function, and switching to behavior-constrained
policy gradient objectives (e.g., DDPG+BC) often leads to substantial
improvements in performance and scalability. Second, we find that a big barrier
to improving offline RL performance is often imperfect policy generalization on
test-time states out of the support of the training data, rather than policy
learning on in-distribution states. We then show that the use of suboptimal but
high-coverage data or test-time policy training techniques can address this
generalization issue in practice. Specifically, we propose two simple test-time
policy improvement methods and show that these methods lead to better
performance.

离线强化学习的性能问题一直存在着，本研究通过比较值函数学习、策略提取和策略泛化这三个组件对离线强化学习的性能进行了系统的实证研究，发现策略提取算法的选择对离线强化学习的性能和可扩展性有着显著影响，同时，离线强化学习的性能问题主要还是由训练数据支持范围之外的测试状态上的策略泛化不完善所导致。本研究提出了两种简单的测试时间策略优化方法，并证明这些方法可以改善离线强化学习的性能。

离线强化学习中，价值学习真的是主要瓶颈吗？

Is Value Learning Really the Main Bottleneck in Offline RL?

Implicit Q-learning (IQL) serves as a strong baseline for offline RL, which
learns the value function using only dataset actions through quantile
regression. However, it is unclear how to recover the implicit policy from the
learned implicit Q-function and why IQL can utilize weighted regression for
policy extraction. IDQL reinterprets IQL as an actor-critic method and gets
weights of implicit policy, however, this weight only holds for the optimal
value function. In this work, we introduce a different way to solve the
implicit policy-finding problem (IPF) by formulating this problem as an
optimization problem. Based on this optimization problem, we further propose
two practical algorithms AlignIQL and AlignIQL-hard, which inherit the
advantages of decoupling actor from critic in IQL and provide insights into why
IQL can use weighted regression for policy extraction. Compared with IQL and
IDQL, we find our method keeps the simplicity of IQL and solves the implicit
policy-finding problem. Experimental results on D4RL datasets show that our
method achieves competitive or superior results compared with other SOTA
offline RL methods. Especially in complex sparse reward tasks like Antmaze and
Adroit, our method outperforms IQL and IDQL by a significant margin.

本研究提出了一种解决隐式策略发现问题的方法，并通过优化问题的形式对其进行了描述。基于这个优化问题，我们进一步提出了两种实用算法 AlignIQL 和 AlignIQL-hard，它们继承了 IQL 中演员和评论家解耦的优势，并阐明了为什么 IQL 可以使用加权回归进行策略提取。实验结果表明，与 IQL 和 IDQL 相比，我们的方法保持了 IQL 的简单性并解决了隐式策略发现问题，在 D4RL 数据集上取得了与其他 SOTA 离线 RL 方法相媲美或更优的结果。特别是在 Antmaze 和 Adroit 等复杂的稀疏奖励任务中，我们的方法明显优于 IQL 和 IDQL。

AlignIQL: 隐式 Q 学习中的策略对齐通过约束优化

AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained  Optimization

Consider learning a policy from example expert behavior, without interaction
with the expert or access to reinforcement signal. One approach is to recover
the expert's cost function with inverse reinforcement learning, then extract a
policy from that cost function with reinforcement learning. This approach is
indirect and can be slow. We propose a new general framework for directly
extracting a policy from data, as if it were obtained by reinforcement learning
following inverse reinforcement learning. We show that a certain instantiation
of our framework draws an analogy between imitation learning and generative
adversarial networks, from which we derive a model-free imitation learning
algorithm that obtains significant performance gains over existing model-free
methods in imitating complex behaviors in large, high-dimensional environments.

提出了一种提取专家行为策略的新框架，直接从数据中提取策略，将模仿学习与生成对抗网络进行比拟，提出了无模型模仿学习算法，并证明该算法在模仿大型、高维度环境中的复杂行为时相对于现有无模型模仿学习方法具有明显性能提升。