While the conditional sequence modeling with the transformer architecture has
demonstrated its effectiveness in dealing with offline reinforcement learning
(RL) tasks, it is struggle to handle out-of-distribution states and actions.
Existing work attempts to address this issue by data augmentation with the
learned policy or adding extra constraints with the value-based RL algorithm.
However, these studies still fail to overcome the following challenges: (1)
insufficiently utilizing the historical temporal information among inter-steps,
(2) overlooking the local intrastep relationships among states, actions and
return-to-gos (RTGs), (3) overfitting suboptimal trajectories with noisy
labels. To address these challenges, we propose Decision Mamba (DM), a novel
multi-grained state space model (SSM) with a self-evolving policy learning
strategy. DM explicitly models the historical hidden state to extract the
temporal information by using the mamba architecture. To capture the
relationship among state-action-RTG triplets, a fine-grained SSM module is
designed and integrated into the original coarse-grained SSM in mamba,
resulting in a novel mamba architecture tailored for offline RL. Finally, to
mitigate the overfitting issue on noisy trajectories, a self-evolving policy is
proposed by using progressive regularization. The policy evolves by using its
own past knowledge to refine the suboptimal actions, thus enhancing its
robustness on noisy demonstrations. Extensive experiments on various tasks show
that DM outperforms other baselines substantially.

决策曼巴（DM）是一种新颖的多粒度状态空间模型（SSM），用于处理条件序列建模和 transformer 架构在离线强化学习（RL）任务中的应用。DM 通过使用曼巴架构明确地建模历史隐藏状态来提取时间信息，并通过细粒度 SSM 模块捕捉状态 - 动作 - 返回三元组之间的关系，从而进行了离线 RL 的定制设计。此外，通过使用渐进正则化来提出自我进化策略，以减轻噪声轨迹导致的过拟合问题。大量的任务实验表明，DM 明显优于其他基准模型。