As an important step towards visual reasoning, visual grounding (e.g., phrase
localization, referring expression comprehension/segmentation) has been widely
explored Previous approaches to referring expression comprehension (REC) or
segmentation (RES) either suffer from limited perform