While large-scale robotic systems typically rely on textual instructions for
tasks, this work explores a different approach: can robots infer the task
directly from observing humans? This shift necessitates the robot's ability to
decode human intent and translate it into executable actions within its
physical constraints and environment. We introduce Vid2Robot, a novel
end-to-end video-based learning framework for robots. Given a video
demonstration of a manipulation task and current visual observations, Vid2Robot
directly produces robot actions. This is achieved through a unified
representation model trained on a large dataset of human video and robot
trajectory. The model leverages cross-attention mechanisms to fuse prompt video
features to the robot's current state and generate appropriate actions that
mimic the observed task. To further improve policy performance, we propose
auxiliary contrastive losses that enhance the alignment between human and robot
video representations. We evaluate Vid2Robot on real-world robots,
demonstrating a 20% improvement in performance compared to other
video-conditioned policies when using human demonstration videos. Additionally,
our model exhibits emergent capabilities, such as successfully transferring
observed motions from one object to another, and long-horizon composition, thus
showcasing its potential for real-world applications. Project website:
vid2robot.github.io

通过观察人类行为并将其翻译成可执行的动作，本研究介绍了一种基于视频学习的机器人框架 Vid2Robot，它通过训练机器人模型利用人类视频和机器人轨迹数据集进行任务执行。该模型利用交叉注意力机制将提示视频特征融合到机器人的当前状态中，并生成能够模仿所观察任务的适当动作，大幅提升执行效果，同时显示着在真实世界应用中的潜力。

Vid2Robot: 基于跨引注意力变形器的端到端视频条件策略学习

Vid2Robot: End-to-end Video-conditioned Policy Learning with  Cross-Attention Transformers

In this paper, we present a video-based learning framework for animating
personalized 3D talking faces from audio. We introduce two training-time data
normalizations that significantly improve data sample efficiency. First, we
isolate and represent faces in a normalized space that decouples 3D geometry,
head pose, and texture. This decomposes the prediction problem into regressions
over the 3D face shape and the corresponding 2D texture atlas. Second, we
leverage facial symmetry and approximate albedo constancy of skin to isolate
and remove spatio-temporal lighting variations. Together, these normalizations
allow simple networks to generate high fidelity lip-sync videos under novel
ambient illumination while training with just a single speaker-specific video.
Further, to stabilize temporal dynamics, we introduce an auto-regressive
approach that conditions the model on its previous visual state. Human ratings
and objective metrics demonstrate that our method outperforms contemporary
state-of-the-art audio-driven video reenactment benchmarks in terms of realism,
lip-sync and visual quality scores. We illustrate several applications enabled
by our framework.

本文提出了一种基于视频学习的框架，用于从音频中制作个性化的 3D 说话人脸，其中使用面部标准化和自回归方法来提高样本效率并生成高保真的口型同步视频。

LipSync3D：使用姿态和光照规范化从视频中高效学习个性化的三维说话脸

LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from  Video using Pose and Lighting Normalization

Temporal cues in videos provide important information for recognizing actions
accurately. However, temporal-discriminative features can hardly be extracted
without using an annotated large-scale video action dataset for training. This
paper proposes a novel Video-based Temporal-Discriminative Learning (VTDL)
framework in self-supervised manner. Without labelled data for network
pretraining, temporal triplet is generated for each anchor video by using
segment of the same or different time interval so as to enhance the capacity
for temporal feature representation. Measuring temporal information by time
derivative, Temporal Consistent Augmentation (TCA) is designed to ensure that
the time derivative (in any order) of the augmented positive is invariant
except for a scaling constant. Finally, temporal-discriminative features are
learnt by minimizing the distance between each anchor and its augmented
positive, while the distance between each anchor and its augmented negative as
well as other videos saved in the memory bank is maximized to enrich the
representation diversity. In the downstream action recognition task, the
proposed method significantly outperforms existing related works. Surprisingly,
the proposed self-supervised approach is better than fully-supervised methods
on UCF101 and HMDB51 when a small-scale video dataset (with only thousands of
videos) is used for pre-training. The code has been made publicly available on
this https URL

本研究提出一种新的基于视频的自监督学习框架 VTDL，通过增加时间三元组数据的容量来增强时间特征表示的能力，并使用 Temporal Consistent Augmentation（TCA）设计器进行时间信息测量。该方法在动作识别任务中具有显著优势，并且在使用小规模视频数据集进行预训练时，该自监督方法优于完全监督方法。