Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), compositional image retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is cos