Implicit Q-learning (IQL) serves as a strong baseline for offline RL, which
learns the value function using only dataset actions through quantile
regression. However, it is unclear how to recover the implicit policy from the
learned implicit Q-function and why IQL can utilize weighted regression for
policy extraction. IDQL reinterprets IQL as an actor-critic method and gets
weights of implicit policy, however, this weight only holds for the optimal
value function. In this work, we introduce a different way to solve the
implicit policy-finding problem (IPF) by formulating this problem as an
optimization problem. Based on this optimization problem, we further propose
two practical algorithms AlignIQL and AlignIQL-hard, which inherit the
advantages of decoupling actor from critic in IQL and provide insights into why
IQL can use weighted regression for policy extraction. Compared with IQL and
IDQL, we find our method keeps the simplicity of IQL and solves the implicit
policy-finding problem. Experimental results on D4RL datasets show that our
method achieves competitive or superior results compared with other SOTA
offline RL methods. Especially in complex sparse reward tasks like Antmaze and
Adroit, our method outperforms IQL and IDQL by a significant margin.

本研究提出了一种解决隐式策略发现问题的方法，并通过优化问题的形式对其进行了描述。基于这个优化问题，我们进一步提出了两种实用算法 AlignIQL 和 AlignIQL-hard，它们继承了 IQL 中演员和评论家解耦的优势，并阐明了为什么 IQL 可以使用加权回归进行策略提取。实验结果表明，与 IQL 和 IDQL 相比，我们的方法保持了 IQL 的简单性并解决了隐式策略发现问题，在 D4RL 数据集上取得了与其他 SOTA 离线 RL 方法相媲美或更优的结果。特别是在 Antmaze 和 Adroit 等复杂的稀疏奖励任务中，我们的方法明显优于 IQL 和 IDQL。