Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited for learning robot manipulation policies compared to current state-of-the-art visual representations purely based on self-supervised objectives. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what's important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks. Extensive experiments across a range of robotic tasks and embodiments, in both simulations and real-world environments, show that our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders including R3M, MVP, and EgoVLP, for downstream manipulation policy-learning. Project page: https://sites.google.com/view/human-oriented-robot-learning

人类具有内在的通用视觉表征，使其能够高效地探索和与环境进行物体操控。本研究提出使用多任务微调的方式在经过预训练的视觉编码器上学习感知技能，通过任务融合解码器指导表示学习，使得对于所有感知技能来说，学习编码的结构能够更好地表示重要信息，最终为下游的机器人操控任务提供帮助。大量实验验证了任务融合解码器在多个机器人任务和仿真及现实环境中对于三种最先进的视觉编码器（R3M、MVP和EgoVLP）的表示进行了改进，提升了下游操控策略的学习性能。

面向人类的机器人操作的表示学习