Current captioning datasets, focus on object-centric captions, describing the
visible objects in the image, often ending up stating the obvious (for humans),
e.g. "people eating food in a park". Although these datasets are useful to
evaluate the ability of Vision & Language models to r