While solving complex manipulation tasks, manipulation policies often need to
learn a set of diverse skills to accomplish these tasks. The set of skills is
often quite multimodal - each one may have a quite distinct distribution of
actions and states. Standard deep policy-learning algorithms often model
policies as deep neural networks with a single output head (deterministic or
stochastic). This structure requires the network to learn to switch between
modes internally, which can lead to lower sample efficiency and poor
performance. In this paper we explore a simple structure which is conducive to
skill learning required for so many of the manipulation tasks. Specifically, we
propose a policy architecture that sequentially executes different action heads
for fixed durations, enabling the learning of primitive skills such as reaching
and grasping. Our empirical evaluation on the Metaworld tasks reveals that this
simple structure outperforms standard policy learning methods, highlighting its
potential for improved skill acquisition.

通过顺序执行不同的动作头部来学习原始技能，我们提出了一种有助于完成操作任务所需的技能学习的策略结构，在 Metaworld 任务上的测试表明，这种简单的结构优于标准策略学习方法，突显了其改进技能获取的潜力。

深度强化学习中的时序索引作为顺序操控任务的归纳偏差

On Time-Indexing as Inductive Bias in Deep RL for Sequential  Manipulation Tasks

Deep reinforcement learning (DRL) is a promising approach for developing
legged locomotion skills. However, the iterative design process that is
inevitable in practice is poorly supported by the default methodology. It is
difficult to predict the outcomes of changes made to the reward functions,
policy architectures, and the set of tasks being trained on. In this paper, we
propose a practical method that allows the reward function to be fully
redefined on each successive design iteration while limiting the deviation from
the previous iteration. We characterize policies via sets of Deterministic
Action Stochastic State (DASS) tuples, which represent the deterministic policy
state-action pairs as sampled from the states visited by the trained stochastic
policy. New policies are trained using a policy gradient algorithm which then
mixes RL-based policy gradients with gradient updates defined by the DASS
tuples. The tuples also allow for robust policy distillation to new network
architectures. We demonstrate the effectiveness of this iterative-design
approach on the bipedal robot Cassie, achieving stable walking with different
gait styles at various speeds. We demonstrate the successful transfer of
policies learned in simulation to the physical robot without any dynamics
randomization, and that variable-speed walking policies for the physical robot
can be represented by a small dataset of 5-10k tuples.

本文提出了一种完全重新定义奖励函数，并限制与之前迭代结果偏差的实际方法，使用一组 DASS 元组来表征政策，并与梯度更新相结合，演示了在 Cassie 二足机器人上实现不同步态风格和不同速度的稳定行走的有效性。