Recent advances in training vision-language models have demonstrated
unprecedented robustness and transfer learning effectiveness; however, standard
computer vision datasets are image-only, and therefore not well adapted to such
training methods. Our paper introduces a simple methodolo