While the conditional sequence modeling with the transformer architecture has
demonstrated its effectiveness in dealing with offline reinforcement learning
(RL) tasks, it is struggle to handle out-of-distribution states and actions.
Existing work attempts to address this issue by data augmentation with the
learned policy or adding extra constraints with the value-based RL algorithm.
However, these studies still fail to overcome the following challenges: (1)
insufficiently utilizing the historical temporal information among inter-steps,
(2) overlooking the local intrastep relationships among states, actions and
return-to-gos (RTGs), (3) overfitting suboptimal trajectories with noisy
labels. To address these challenges, we propose Decision Mamba (DM), a novel
multi-grained state space model (SSM) with a self-evolving policy learning
strategy. DM explicitly models the historical hidden state to extract the
temporal information by using the mamba architecture. To capture the
relationship among state-action-RTG triplets, a fine-grained SSM module is
designed and integrated into the original coarse-grained SSM in mamba,
resulting in a novel mamba architecture tailored for offline RL. Finally, to
mitigate the overfitting issue on noisy trajectories, a self-evolving policy is
proposed by using progressive regularization. The policy evolves by using its
own past knowledge to refine the suboptimal actions, thus enhancing its
robustness on noisy demonstrations. Extensive experiments on various tasks show
that DM outperforms other baselines substantially.

决策曼巴（DM）是一种新颖的多粒度状态空间模型（SSM），用于处理条件序列建模和 transformer 架构在离线强化学习（RL）任务中的应用。DM 通过使用曼巴架构明确地建模历史隐藏状态来提取时间信息，并通过细粒度 SSM 模块捕捉状态 - 动作 - 返回三元组之间的关系，从而进行了离线 RL 的定制设计。此外，通过使用渐进正则化来提出自我进化策略，以减轻噪声轨迹导致的过拟合问题。大量的任务实验表明，DM 明显优于其他基准模型。

决策猛蛇：一种具有自我演进正则化的离线强化学习多粒度状态空间模型

Decision Mamba: A Multi-Grained State Space Model with Self-Evolution  Regularization for Offline RL

Resource-constrained robotic platforms are particularly useful for tasks that
require low-cost hardware alternatives due to the risk of losing the robot,
like in search-and-rescue applications, or the need for a large number of
devices, like in swarm robotics. For this reason, it is crucial to find
mechanisms for adapting reinforcement learning techniques to the constraints
imposed by lower computational power and smaller memory capacities of these
ultra low-cost robotic platforms. We try to address this need by proposing a
method for making imitation learning deployable onto resource-constrained
robotic platforms. Here we cast the imitation learning problem as a conditional
sequence modeling task and we train a decision transformer using expert
demonstrations augmented with a custom reward. Then, we compress the resulting
generative model using software optimization schemes, including quantization
and pruning. We test our method in simulation using Isaac Gym, a realistic
physics simulation environment designed for reinforcement learning. We
empirically demonstrate that our method achieves natural looking gaits for
Bittle, a resource-constrained quadruped robot. We also run multiple
simulations to show the effects of pruning and quantization on the performance
of the model. Our results show that quantization (down to 4 bits) and pruning
reduce model size by around 30\% while maintaining a competitive reward, making
the model deployable in a resource-constrained system.

我们提出了一种方法，将模仿学习应用于资源受限的机器人平台，通过将模仿学习问题视为条件序列建模任务，使用专家示范增强的自定义奖励训练决策变压器，并利用量化和修剪等软件优化方案压缩生成模型，在 Isaac Gym 仿真环境中验证了该方法，在资源受限的四足机器人 Bittle 上实现了自然步态，并通过多次模拟展示了修剪和量化对模型性能的影响，结果表明，量化（降至 4 位）和修剪可将模型大小减小约 30％，同时保持有竞争力的奖励，使模型可以在资源受限的系统中投入使用。

使用决策变换器进行四足动物运动的微型强化学习

Tiny Reinforcement Learning for Quadruped Locomotion using Decision  Transformers

We present Skill Transformer, an approach for solving long-horizon robotic
tasks by combining conditional sequence modeling and skill modularity.
Conditioned on egocentric and proprioceptive observations of a robot, Skill
Transformer is trained end-to-end to predict both a high-level skill (e.g.,
navigation, picking, placing), and a whole-body low-level action (e.g., base
and arm motion), using a transformer architecture and demonstration
trajectories that solve the full task. It retains the composability and
modularity of the overall task through a skill predictor module while reasoning
about low-level actions and avoiding hand-off errors, common in modular
approaches. We test Skill Transformer on an embodied rearrangement benchmark
and find it performs robust task planning and low-level control in new
scenarios, achieving a 2.5x higher success rate than baselines in hard
rearrangement problems.

通过结合条件序列建模和技能模块化，我们提出了 Skill Transformer 方法，用于解决长期规划的机器人任务，并通过 Transformer 架构和演示轨迹对高级技能和低级动作进行端到端训练，并通过技能预测模块保持整体任务的组合性和模块化，同时考虑低级动作并避免常见的模块化方法中的交接错误。在具有挑战性的重新排列问题中，我们对 Skill Transformer 进行了测试，发现其在新场景中执行稳健的任务规划和低级控制，并在成功率上比基线提高了 2.5 倍。