从文本网络监督中学习视频表征

Jul, 2020

Learning Video Representations from Textual Web Supervision

Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar...

TL;DR通过采集 70M个公开的视频并使用相关的文本描述进行自我监督训练，本文提出了一种基于文本的学习视频表示的方法，证明了这种方法在预训练视频表示中比现有的方法更有效。

Abstract

Videos found on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use such text as a method for learning →