Videos on the Internet are paired with pieces of text, such as titles and
descriptions. This text typically describes the most important content in the
video, such as the objects in the scene and the actions being performed. Based
on this observation, we propose to use text as a method for learning video
representations. To accomplish this, we propose a data