Learned joint representations of images and text form the backbone of several
important cross-domain tasks such as image captioning. Prior work mostly maps
both domains into a common latent representation in a purely supervised
fashion. This is rather restrictive, however, as the two d