We present an approach for recommending a music track for a given video, and
vice versa, based on both their temporal alignment and their correspondence at
an artistic level. We propose a self-supervised approach that learns this
correspondence directly from data, without any need of human annotations. In
order to capture the high-level concepts that are req