Because of the rich dynamical structure of videos and their ubiquity in everyday life, it is a natural idea that video data could serve as a powerful unsupervised learning signal for training visual representations in deep neural networks. However, instantiating this idea, especially at large scale, has remained a significant artificial intelligence challenge. Here we present the Video Instance Embedding (VIE) framework, which extends powerful recent unsupervised loss functions for learning deep nonlinear embeddings to multi-stream temporal processing architectures on large-scale video datasets. We show that VIE-trained networks substantially advance the state of the art in unsupervised learning from video datastreams, both for action recognition in the Kinetics dataset, and object recognition in the ImageNet dataset. We show that a hybrid model with both static and dynamic processing pathways is optimal for both transfer tasks, and provide analyses indicating how the pathways differ. Taken in context, our results suggest that deep neural embeddings are a promising approach to unsupervised visual learning across a wide variety of domains.

本文介绍了Video Instance Embedding（VIE）框架，它扩展了用于学习深度非线性嵌入的强大无监督损失函数以进行大规模视频数据集上的多流时间处理架构，展示了VIE训练的网络在Kinetics数据集的动作识别和ImageNet数据集的目标识别中有重大发展，并提供了分析表明路径如何有所不同。

深度神经嵌入技术在视频无监督学习中的应用