tireid aims to retrieve the image corresponding to the given text query from
a pool of candidate images. Existing methods employ prior knowledge from
single-modality pre-training to facilitate learning, but lack multi-modal
correspondences. Besides, due to the substantial gap between m