This paper strives to find amidst a set of sentences the one best describing
the content of a given image or video. Different from existing works, which
rely on a joint subspace for their image and video caption retrieval, we
propose to do so in a visual space exclusively. Apart from t