Behavioral cloning (BC) provides a straightforward solution to offline RL by
mimicking offline trajectories via supervised learning. Recent advances (Chen
et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by
conditioning on desired future returns, BC can perform competitively to their
value-based counterparts, while enjoying much more simplicity and training
stability. While promising, we show that these methods can be unreliable, as
their performance may degrade significantly when conditioned on high,
out-of-distribution (ood) returns. This is crucial in practice, as we often
expect the policy to perform better than the offline dataset by conditioning on
an ood value. We show that this unreliability arises from both the
suboptimality of training data and model architectures. We propose
ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for
improving the reliability of conditional BC with two key components: trajectory
weighting and conservative regularization. Trajectory weighting upweights the
high-return trajectories to reduce the train-test gap for BC methods, while
conservative regularizer encourages the policy to stay close to the data
distribution for ood conditioning. We study CWBC in the context of RvS (Emmons
et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC
significantly boosts their performance on various benchmarks.

本文介绍了一种改进版的行为克隆方法，即 ConserWeightive Behavioral Cloning，该方法包含轨迹权重和保守正则化两个核心组件，通过提高高回报轨迹的权重和鼓励策略在数据分布附近保持稳定，从而提高条件行为克隆的可靠性，并在多个基准测试中得到良好表现。