Robot learning tasks are extremely compute-intensive and hardware-specific.
Thus the avenues of tackling these challenges, using a diverse dataset of
offline demonstrations that can be used to train robot manipulation agents, is
very appealing. The Train-Offline-Test-Online (TOTO) Benchmark provides a
well-curated open-source dataset for offline training comprised mostly of
expert data and also benchmark scores of the common offline-RL and behaviour
cloning agents. In this paper, we introduce DiffClone, an offline algorithm of
enhanced behaviour cloning agent with diffusion-based policy learning, and
measured the efficacy of our method on real online physical robots at test
time. This is also our official submission to the Train-Offline-Test-Online
(TOTO) Benchmark Challenge organized at NeurIPS 2023. We experimented with both
pre-trained visual representation and agent policies. In our experiments, we
find that MOCO finetuned ResNet50 performs the best in comparison to other
finetuned representations. Goal state conditioning and mapping to transitions
resulted in a minute increase in the success rate and mean-reward. As for the
agent policy, we developed DiffClone, a behaviour cloning agent improved using
conditional diffusion.

本文介绍了一种使用离线演示数据集进行训练的增强行为克隆代理的离线算法 DiffClone，并在真实在线物理机器人上测试了该方法的有效性。

DiffClone: 强化行为克隆机器人中的扩散驱动策略学习

DiffClone: Enhanced Behaviour Cloning in Robotics with Diffusion-Driven  Policy Learning

Deep reinforcement learning (DRL) faces significant challenges in addressing
the hard-exploration problems in tasks with sparse or deceptive rewards and
large state spaces. These challenges severely limit the practical application
of DRL. Most previous exploration methods relied on complex architectures to
estimate state novelty or introduced sensitive hyperparameters, resulting in
instability. To mitigate these issues, we propose an efficient adaptive
trajectory-constrained exploration strategy for DRL. The proposed method guides
the policy of the agent away from suboptimal solutions by leveraging incomplete
offline demonstrations as references. This approach gradually expands the
exploration scope of the agent and strives for optimality in a constrained
optimization manner. Additionally, we introduce a novel policy-gradient-based
optimization algorithm that utilizes adaptively clipped trajectory-distance
rewards for both single- and multi-agent reinforcement learning. We provide a
theoretical analysis of our method, including a deduction of the worst-case
approximation error bounds, highlighting the validity of our approach for
enhancing exploration. To evaluate the effectiveness of the proposed method, we
conducted experiments on two large 2D grid world mazes and several MuJoCo
tasks. The extensive experimental results demonstrate the significant
advantages of our method in achieving temporally extended exploration and
avoiding myopic and suboptimal behaviors in both single- and multi-agent
settings. Notably, the specific metrics and quantifiable results further
support these findings. The code used in the study is available at
https://github.com/buaawgj/TACE.

提出了一种用于深度强化学习的高效适应性轨迹约束探索策略，利用不完整的离线演示作为参考，引入了一种新的基于策略梯度的优化算法，为单智能体和多智能体强化学习提供了一种自适应剪切轨迹距离奖励的方法。通过对两个大型二维网格世界迷宫和几个 MuJoCo 任务的实验验证，证明了该方法在实现时间延长的探索和避免短视和次优行为方面的显著优势。

深度强化学习自适应路径约束的探索策略

Adaptive trajectory-constrained exploration strategy for deep  reinforcement learning

Visual imitation learning provides a framework for learning complex
manipulation behaviors by leveraging human demonstrations. However, current
interfaces for imitation such as kinesthetic teaching or teleoperation
prohibitively restrict our ability to efficiently collect large-scale data in
the wild. Obtaining such diverse demonstration data is paramount for the
generalization of learned skills to novel scenarios. In this work, we present
an alternate interface for imitation that simplifies the data collection
process while allowing for easy transfer to robots. We use commercially
available reacher-grabber assistive tools both as a data collection device and
as the robot's end-effector. To extract action information from these visual
demonstrations, we use off-the-shelf Structure from Motion (SfM) techniques in
addition to training a finger detection network. We experimentally evaluate on
two challenging tasks: non-prehensile pushing and prehensile stacking, with
1000 diverse demonstrations for each task. For both tasks, we use standard
behavior cloning to learn executable policies from the previously collected
offline demonstrations. To improve learning performance, we employ a variety of
data augmentations and provide an extensive analysis of its effects. Finally,
we demonstrate the utility of our interface by evaluating on real robotic
scenarios with previously unseen objects and achieve a 87% success rate on
pushing and a 62% success rate on stacking. Robot videos are available at
this https URL

本研究提出了一种基于商用机械臂的可行的、简化了数据收集过程的、能够有效采集各种复杂场景演示数据的人机互动界面，并采用了多项数据增强技术来优化模型的学习性能，最终在非捏取推动和物品堆叠等任务中获得了较高的成功率。