Aligning image and text encoders from scratch using contrastive learning
requires large amounts of paired image-text data. We alleviate this need by
aligning individually pre-trained language and vision representation models
using a much smaller amount of paired data, augmented with a