In this paper, we consider the problem of learning a visual representation from the raw spatiotemporal signals in videos for use in action recognition. Our representation is learned without supervision from semantic labels. We formulate it as an unsupervised sequential verification tas