The ability to learn from human demonstration endows robots with the ability
to automate various tasks. However, directly learning from human demonstration
is challenging since the structure of the human hand can be very different from
the desired robot gripper. In this work, we show that manipulation skills can
be transferred from a human to a robot through the use of micro-evolutionary
reinforcement learning, where a five-finger human dexterous hand robot
gradually evolves into a commercial robot, while repeated interacting in a
physics simulator to continuously update the policy that is first learned from
human demonstration. To deal with the high dimensions of robot parameters, we
propose an algorithm for multi-dimensional evolution path searching that allows
joint optimization of both the robot evolution path and the policy. Through
experiments on human object manipulation datasets, we show that our framework
can efficiently transfer the expert human agent policy trained from human
demonstrations in diverse modalities to target commercial robots.

本论文介绍了如何通过微进化强化学习的方法，将人类操作技能转移到商业机器人上，同时提出了多维进化路径搜索算法以及专家人类代理政策的转移，通过实验验证了该框架的有效性。

HERD: 持续人机进化的学习人类示范方法

HERD: Continuous Human-to-Robot Evolution for Learning from Human Demonstration

Reward and representation learning are two long-standing challenges for
learning an expanding set of robot manipulation skills from sensory
observations. Given the inherent cost and scarcity of in-domain, task-specific
robot data, learning from large, diverse, offline human videos has emerged as a
promising path towards acquiring a generally useful visual representation for
control; however, how these human videos can be used for general-purpose reward
learning remains an open question. We introduce
$\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a
self-supervised pre-trained visual representation capable of generating dense
and smooth reward functions for unseen robotic tasks. VIP casts representation
learning from human videos as an offline goal-conditioned reinforcement
learning problem and derives a self-supervised dual goal-conditioned
value-function objective that does not depend on actions, enabling pre-training
on unlabeled human videos. Theoretically, VIP can be understood as a novel
implicit time contrastive objective that generates a temporally smooth
embedding, enabling the value function to be implicitly defined via the
embedding distance, which can then be used to construct the reward for any
goal-image specified downstream task. Trained on large-scale Ego4D human videos
and without any fine-tuning on in-domain, task-specific data, VIP's frozen
representation can provide dense visual reward for an extensive set of
simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual
control methods and significantly outperforming all prior pre-trained
representations. Notably, VIP can enable simple, $\textbf{few-shot}$ offline RL
on a suite of real-world robot tasks with as few as 20 trajectories.

本研究提出了一种称为 VIP 的表示自学习方法，通过自监督目标条件强化学习的方式从未标注的人类视频中生成稠密的，可平滑的奖励函数，克服机器人数据获取上的困难，并在实验中表现出优异的表现。

VIP：通过价值内隐预训练实现通用视觉奖励和表示

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Imitation Learning is a promising paradigm for learning complex robot
manipulation skills by reproducing behavior from human demonstrations. However,
manipulation tasks often contain bottleneck regions that require a sequence of
precise actions to make meaningful progress, such as a robot inserting a pod
into a coffee machine to make coffee. Trained policies can fail in these
regions because small deviations in actions can lead the policy into states not
covered by the demonstrations. Intervention-based policy learning is an
alternative that can address this issue -- it allows human operators to monitor
trained policies and take over control when they encounter failures. In this
paper, we build a data collection system tailored to 6-DoF manipulation
settings, that enables remote human operators to monitor and intervene on
trained policies. We develop a simple and effective algorithm to train the
policy iteratively on new data collected by the system that encourages the
policy to learn how to traverse bottlenecks through the interventions. We
demonstrate that agents trained on data collected by our intervention-based
system and algorithm outperform agents trained on an equivalent number of
samples collected by non-interventional demonstrators, and further show that
our method outperforms multiple state-of-the-art baselines for learning from
the human interventions on a challenging robot threading task and a coffee
making task. Additional results and videos at
this https URL .

本文介绍使用干预性策略学习的方法来解决机器人操作任务中必须经过精确定序的地方的问题，提出一种 6 自由度机器人操作任务的数据采集系统，并开发了一个简单而有效的算法来收集新数据以遍历通过这些难点，使用干预策略学习的代理在机器人的线路穿线任务和制造咖啡任务中的表现优于其他多种基线算法。