We seek to learn a generalizable goal-conditioned policy that enables
zero-shot robot manipulation: interacting with unseen objects in novel scenes
without test-time adaptation. While typical approaches rely on a large amount
of demonstration data for such generalization, we propose an approach that
leverages web videos to predict plausible interaction plans and learns a
task-agnostic transformation to obtain robot actions in the real world. Our
framework,Track2Act predicts tracks of how points in an image should move in
future time-steps based on a goal, and can be trained with diverse videos on
the web including those of humans and robots manipulating everyday objects. We
use these 2D track predictions to infer a sequence of rigid transforms of the
object to be manipulated, and obtain robot end-effector poses that can be
executed in an open-loop manner. We then refine this open-loop plan by
predicting residual actions through a closed loop policy trained with a few
embodiment-specific demonstrations. We show that this approach of combining
scalably learned track prediction with a residual policy requiring minimal
in-domain robot-specific data enables zero-shot robot manipulation, and present
a wide array of real-world robot manipulation results across unseen tasks,
objects, and scenes. this https URL

通过使用网络视频进行预测性互动计划的学习，结合机器人在真实世界中的任务无关转换，以及通过少量具体示范进行训练的闭环策略预测残余动作，我们提出了一种可扩展学习的路径预测方法，实现了任务零 - shot 机器人操作，并在未见任务、对象和场景上呈现了大量真实世界机器人操作结果。

Track2Act: 从互联网视频预测点轨迹实现多样化的零射击机器人操控

Track2Act: Predicting Point Tracks from Internet Videos enables Diverse  Zero-shot Robot Manipulation

The use of anthropomorphic robotic hands for assisting individuals in
situations where human hands may be unavailable or unsuitable has gained
significant importance. In this paper, we propose a novel task called
human-assisting dexterous grasping that aims to train a policy for controlling
a robotic hand's fingers to assist users in grasping objects. Unlike
conventional dexterous grasping, this task presents a more complex challenge as
the policy needs to adapt to diverse user intentions, in addition to the
object's geometry. We address this challenge by proposing an approach
consisting of two sub-modules: a hand-object-conditional grasping primitive
called Grasping Gradient Field~(GraspGF), and a history-conditional residual
policy. GraspGF learns `how' to grasp by estimating the gradient from a success
grasping example set, while the residual policy determines `when' and at what
speed the grasping action should be executed based on the trajectory history.
Experimental results demonstrate the superiority of our proposed method
compared to baselines, highlighting the user-awareness and practicality in
real-world applications. The codes and demonstrations can be viewed at
"this https URL".

提出了一种新的人机辅助灵巧抓取任务，该任务旨在训练一个用于控制机器人手指来辅助用户抓取物体的策略。通过提出 Grasping Gradient Field（GraspGF）和基于历史的剩余策略，解决了用户意图多样性和物体几何形状的挑战，实验证明了该方法在实际应用中的用户感知和实用性的优越性。