Implicit Q-learning (IQL) serves as a strong baseline for offline RL, which
learns the value function using only dataset actions through quantile
regression. However, it is unclear how to recover the implicit policy from the
learned implicit Q-function and why IQL can utilize weighted regression for
policy extraction. IDQL reinterprets IQL as an actor-critic method and gets
weights of implicit policy, however, this weight only holds for the optimal
value function. In this work, we introduce a different way to solve the
implicit policy-finding problem (IPF) by formulating this problem as an
optimization problem. Based on this optimization problem, we further propose
two practical algorithms AlignIQL and AlignIQL-hard, which inherit the
advantages of decoupling actor from critic in IQL and provide insights into why
IQL can use weighted regression for policy extraction. Compared with IQL and
IDQL, we find our method keeps the simplicity of IQL and solves the implicit
policy-finding problem. Experimental results on D4RL datasets show that our
method achieves competitive or superior results compared with other SOTA
offline RL methods. Especially in complex sparse reward tasks like Antmaze and
Adroit, our method outperforms IQL and IDQL by a significant margin.

本研究提出了一种解决隐式策略发现问题的方法，并通过优化问题的形式对其进行了描述。基于这个优化问题，我们进一步提出了两种实用算法 AlignIQL 和 AlignIQL-hard，它们继承了 IQL 中演员和评论家解耦的优势，并阐明了为什么 IQL 可以使用加权回归进行策略提取。实验结果表明，与 IQL 和 IDQL 相比，我们的方法保持了 IQL 的简单性并解决了隐式策略发现问题，在 D4RL 数据集上取得了与其他 SOTA 离线 RL 方法相媲美或更优的结果。特别是在 Antmaze 和 Adroit 等复杂的稀疏奖励任务中，我们的方法明显优于 IQL 和 IDQL。

AlignIQL: 隐式 Q 学习中的策略对齐通过约束优化

AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained  Optimization

Offline reinforcement learning (RL) defines a sample-efficient learning
paradigm, where a policy is learned from static and previously collected
datasets without additional interaction with the environment. The major
obstacle to offline RL is the estimation error arising from evaluating the
value of out-of-distribution actions. To tackle this problem, most existing
offline RL methods attempt to acquire a policy both ``close" to the behaviors
contained in the dataset and sufficiently improved over them, which requires a
trade-off between two possibly conflicting targets. In this paper, we propose a
novel approach, which we refer to as adaptive behavior regularization (ABR), to
balance this critical trade-off. By simply utilizing a sample-based
regularization, ABR enables the policy to adaptively adjust its optimization
objective between cloning and improving over the policy used to generate the
dataset. In the evaluation on D4RL datasets, a widely adopted benchmark for
offline reinforcement learning, ABR can achieve improved or competitive
performance compared to existing state-of-the-art algorithms.

本文提出了自适应行为正则化（ABR）的方法改善已有机器学习数据集中存在的行为采样偏差，从而提高了离线强化学习的效率和稳定性，并在 D4RL 数据集上实现了最新算法中更好或相当的性能。

自适应行为正则化的离线强化学习

Offline Reinforcement Learning with Adaptive Behavior Regularization

We present state advantage weighting for offline reinforcement learning (RL).
In contrast to action advantage $A(s,a)$ that we commonly adopt in QSA
learning, we leverage state advantage $A(s,s^\prime)$ and QSS learning for
offline RL, hence decoupling the action from values. We expect the agent can
get to the high-reward state and the action is determined by how the agent can
get to that corresponding state. Experiments on D4RL datasets show that our
proposed method can achieve remarkable performance against the common
baselines. Furthermore, our method shows good generalization capability when
transferring from offline to online.

本文提出了一种基于状态优势加权和 QSS 学习的离线强化学习方法，相比于传统的基于动作优势的方法能够更好地实现从离线到在线的转移，实验结果显示，该方法在 D4RL 数据集上表现出显著的性能优势和良好的泛化能力。