We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with a single flow matching policy also leads to consistently better performance than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.

本研究解决了助理机器人操控中的两个关键挑战：如何有效调整大规模模型以适应实际场景下的可供性理解任务，以及如何通过视觉可供性模型引导机器人轨迹的学习。我们提出了一种高效的提示调优方法，使冷冻的视觉模型能够在多任务场景中预测操控可供性，同时使用流匹配方法学习机器人轨迹，所得结果在日常生活活动中表现优异，且全面超越了其他调优协议，展示出显著的实用性和效率。

基于可供性的机器人操控与流匹配