text-to-image person retrieval aims to identify the target person based on a
given textual description query. The primary challenge is to learn the mapping
of visual and textual modalities into a common latent space. Prior works have
attempted to address this challenge by leveraging se