text-based person retrieval aims to find the query person based on a textual
description. The key is to learn a common latent space mapping between
visual-textual modalities. To achieve this goal, existing works employ
segmentation to obtain explicitly cross-modal alignments or utilize