While imitation learning provides us with an efficient toolkit to train
robots, learning skills that are robust to environment variations remains a
significant challenge. Current approaches address this challenge by relying
either on large amounts of demonstrations that span environment variations or
on handcrafted reward functions that require state estimates. Both directions
are not scalable to fast imitation. In this work, we present Fast Imitation of
Skills from Humans (FISH), a new imitation learning approach that can learn
robust visual skills with less than a minute of human demonstrations. Given a
weak base-policy trained by offline imitation of demonstrations, FISH computes
rewards that correspond to the "match" between the robot's behavior and the
demonstrations. These rewards are then used to adaptively update a residual
policy that adds on to the base-policy. Across all tasks, FISH requires at most
twenty minutes of interactive learning to imitate demonstrations on object
configurations that were not seen in the demonstrations. Importantly, FISH is
constructed to be versatile, which allows it to be used across robot
morphologies (e.g. xArm, Allegro, Stretch) and camera configurations (e.g.
third-person, eye-in-hand). Our experimental evaluations on 9 different tasks
show that FISH achieves an average success rate of 93%, which is around 3.8x
higher than prior state-of-the-art methods.

FISH is a versatile imitation learning approach that can achieve an average success rate of 93% on robotic tasks by computing rewards and adaptively updating a residual policy for robust visual skills with less than a minute of human demonstrations, making it fast and efficient.

教机器人钓鱼：从一分钟演示中学习多功能模仿

Teach a Robot to FISH: Versatile Imitation from One Minute of Demonstrations

Generalization in deep reinforcement learning over unseen environment
variations usually requires policy learning over a large set of diverse
training variations. We empirically observe that an agent trained on many
variations (a generalist) tends to learn faster at the beginning, yet its
performance plateaus at a less optimal level for a long time. In contrast, an
agent trained only on a few variations (a specialist) can often achieve high
returns under a limited computational budget. To have the best of both worlds,
we propose a novel generalist-specialist training framework. Specifically, we
first train a generalist on all environment variations; when it fails to
improve, we launch a large population of specialists with weights cloned from
the generalist, each trained to master a selected small subset of variations.
We finally resume the training of the generalist with auxiliary rewards induced
by demonstrations of all specialists. In particular, we investigate the timing
to start specialist training and compare strategies to learn generalists with
assistance from specialists. We show that this framework pushes the envelope of
policy learning on several challenging and popular benchmarks including
Procgen, Meta-World and ManiSkill.

本文提出了一种新的强化学习通用 - 专业训练框架，通过辅助奖励和权重克隆的方式，将先前的训练经历分为 “通用训练” 和 “专业训练”，以在不同环境下实现最佳政策学习。

用综合专业学习提高政策优化

Improving Policy Optimization with Generalist-Specialist Learning

Robots will experience non-stationary environment dynamics throughout their
lifetime: the robot dynamics can change due to wear and tear, or its
surroundings may change over time. Eventually, the robots should perform well
in all of the environment variations it has encountered. At the same time, it
should still be able to learn fast in a new environment. We identify two
challenges in Reinforcement Learning (RL) under such a lifelong learning
setting with off-policy data: first, existing off-policy algorithms struggle
with the trade-off between being conservative to maintain good performance in
the old environment and learning efficiently in the new environment, despite
keeping all the data in the replay buffer. We propose the Offline Distillation
Pipeline to break this trade-off by separating the training procedure into an
online interaction phase and an offline distillation phase.Second, we find that
training with the imbalanced off-policy data from multiple environments across
the lifetime creates a significant performance drop. We identify that this
performance drop is caused by the combination of the imbalanced quality and
size among the datasets which exacerbate the extrapolation error of the
Q-function. During the distillation phase, we apply a simple fix to the issue
by keeping the policy closer to the behavior policy that generated the data. In
the experiments, we demonstrate these two challenges and the proposed solutions
with a simulated bipedal robot walk-ing task across various environment
changes. We show that the Offline Distillation Pipeline achieves better
performance across all the encountered environments without affecting data
collection. We also provide a comprehensive empirical study to support our
hypothesis on the data imbalance issue.

本文介绍了在生命周期内，机器人应该如何快速适应不断变化的环境，在强化学习领域下提出了离线蒸馏管道算法，解决了传统算法在新旧环境中表现的困境以及在多种环境中训练数据失衡等问题，并通过模拟仿生机器人步行任务的实验进行了检验。