Offline Reinforcement Learning (RL) is structured to derive policies from
static trajectory data without requiring real-time environment interactions.
Recent studies have shown the feasibility of framing offline RL as a sequence
modeling task, where the sole aim is to predict actions based on prior context
using the transformer architecture. However, the limitation of this single task
learning approach is its potential to undermine the transformer model's
attention mechanism, which should ideally allocate varying attention weights
across different tokens in the input context for optimal prediction. To address
this, we reformulate offline RL as a multi-objective optimization problem,
where the prediction is extended to states and returns. We also highlight a
potential flaw in the trajectory representation used for sequence modeling,
which could generate inaccuracies when modeling the state and return
distributions. This is due to the non-smoothness of the action distribution
within the trajectory dictated by the behavioral policy. To mitigate this
issue, we introduce action space regions to the trajectory representation. Our
experiments on D4RL benchmark locomotion tasks reveal that our propositions
allow for more effective utilization of the attention mechanism in the
transformer model, resulting in performance that either matches or outperforms
current state-of-the art methods.

离线强化学习以多目标优化问题的形式重新定义为序列建模任务，引入动作空间区域以解决变换模型的关注机制在输入上分配变化注意权重的潜在问题。实验证明这些提议使得变换模型更有效地利用关注机制，达到或超过当前最先进方法的性能。