linguistic representations derived from text alone have been criticized for
their lack of grounding, i.e., connecting words to their meanings in the
physical world. Vision-and-Language (VL) models, trained jointly on text and
image or video data, have been offered as a response to such