Learning effective continuous control policies in high-dimensional systems,
including musculoskeletal agents, remains a significant challenge. Over the
course of biological evolution, organisms have developed robust mechanisms for
overcoming this complexity to learn highly sophisticated strategies for motor
control. What accounts for this robust behavioral flexibility? Modular control
via muscle synergies, i.e. coordinated muscle co-contractions, is considered to
be one putative mechanism that enables organisms to learn muscle control in a
simplified and generalizable action space. Drawing inspiration from this
evolved motor control strategy, we use physiologically accurate human hand and
leg models as a testbed for determining the extent to which a Synergistic
Action Representation (SAR) acquired from simpler tasks facilitates learning
more complex tasks. We find in both cases that SAR-exploiting policies
significantly outperform end-to-end reinforcement learning. Policies trained
with SAR were able to achieve robust locomotion on a wide set of terrains with
high sample efficiency, while baseline approaches failed to learn meaningful
behaviors. Additionally, policies trained with SAR on a multiobject
manipulation task significantly outperformed (>70% success) baseline approaches
(<20% success). Both of these SAR-exploiting policies were also found to
generalize zero-shot to out-of-domain environmental conditions, while policies
that did not adopt SAR failed to generalize. Finally, we establish the
generality of SAR on broader high-dimensional control problems using a robotic
manipulation task set and a full-body humanoid locomotion task. To the best of
our knowledge, this investigation is the first of its kind to present an
end-to-end pipeline for discovering synergies and using this representation to
learn high-dimensional continuous control across a wide diversity of tasks.

通过灵活运用 Synergistic Action Representation (SAR) 作为一种合适的控制机制，可以有效地学习高维度连续控制任务，提高样本效率，并在广泛的任务领域中实现零样本泛化。

通过协同行动表示实现生理敏捷性和灵活性的普适性推广

SAR: Generalization of Physiological Agility and Dexterity via  Synergistic Action Representation

We present a unified framework for learning continuous control policies using
backpropagation. It supports stochastic control by treating stochasticity in
the Bellman equation as a deterministic function of exogenous noise. The
product is a spectrum of general policy gradient algorithms that range from
model-free methods with value functions to model-based methods without value
functions. We use learned models but only require observations from the
environment in- stead of observations from model-predicted trajectories,
minimizing the impact of compounded model errors. We apply these algorithms
first to a toy stochastic control problem and then to several physics-based
control problems in simulation. One of these variants, SVG(1), shows the
effectiveness of learning models, value functions, and policies simultaneously
in continuous domains.

本文提出了一种使用反向传播学习连续控制策略的统一框架，并通过将贝尔曼方程中的随机性视为外源噪声的确定性函数，来支持随机控制。结果是一系列从有值函数的无模型方法到无值函数的有模型方法的通用策略梯度算法谱。我们使用学习模型，但只需要来自环境的观察而不是模型预测轨迹的观察，最大程度地减少复合模型错误的影响。我们首先将这些算法应用于一个玩具随机控制问题，然后在模拟中将其应用于几个基于物理的控制问题。其中一种变体 SVG（1）显示了在连续领域同时学习模型，价值函数和策略的有效性。