Offline reinforcement learning methods hold the promise of learning policies
from pre-collected datasets without the need to query the environment for new
transitions. This setting is particularly well-suited for continuous control
robotic applications for which online data collection based on trial-and-error
is costly and potentially unsafe. In practice, offline datasets are often
heterogeneous, i.e., collected in a variety of scenarios, such as data from
several human demonstrators or from policies that act with different purposes.
Unfortunately, such datasets can exacerbate the distribution shift between the
behavior policy underlying the data and the optimal policy to be learned,
leading to poor performance. To address this challenge, we propose to leverage
latent-variable policies that can represent a broader class of policy
distributions, leading to better adherence to the training data distribution
while maximizing reward via a policy over the latent variable. As we
empirically show on a range of simulated locomotion, navigation, and
manipulation tasks, our method referred to as latent-variable
advantage-weighted policy optimization (LAPO), improves the average performance
of the next best-performing offline reinforcement learning methods by 49% on
heterogeneous datasets, and by 8% on datasets with narrow and biased
distributions.

本文提出了一种名为 LAPO（latent-variable advantage-weighted policy optimization）的方法，通过使用潜变量的策略来解决离线数据集分布偏移问题，取得了在多项任务中超越同类方法的显著性能提升。

离线强化学习中的潜变量优势加权策略优化

Latent-Variable Advantage-Weighted Policy Optimization for Offline RL

Deep Reinforcement Learning (DRL) algorithms for continuous action spaces are
known to be brittle toward hyperparameters as well as \cut{being}sample
inefficient. Soft Actor Critic (SAC) proposes an off-policy deep actor critic
algorithm within the maximum entropy RL framework which offers greater
stability and empirical gains. The choice of policy distribution, a factored
Gaussian, is motivated by \cut{chosen due}its easy re-parametrization rather
than its modeling power. We introduce Normalizing Flow policies within the SAC
framework that learn more expressive classes of policies than simple factored
Gaussians. \cut{We also present a series of stabilization tricks that enable
effective training of these policies in the RL setting.}We show empirically on
continuous grid world tasks that our approach increases stability and is better
suited to difficult exploration in sparse reward settings.

该研究提出了一种基于 Soft Actor Critic 算法的正态流策略分布模型，增加了模型的表达能力以提高稳定性和适应稀疏奖励环境下的探索能力。