This paper attacks the challenging problem of video retrieval by text. In
such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc
queries described exclusively in the form of a natural-language sentence, with
no visual example provided. Given videos as sequences