Suppose that we are given a set of videos, along with natural language
descriptions in the form of multiple sentences (e.g., manual annotations, movie
scripts, sport summaries etc.), and that these sentences appear in the same
temporal order as their visual counterparts. We propose in