Rapid progress is being made in developing large, pretrained, task-agnostic
foundational vision models such as CLIP, ALIGN, DINOv2, etc. In fact, we are
approaching the point where these models do not have to be finetuned
downstream, and can simply be used in zero-shot or with a lightw