We investigate the problem of video referring expression comprehension (REC),
which aims to localize the referent objects described in the sentence to visual
regions in the video frames. Despite the recent progress, existing methods
suffer from two problems: 1) inconsistent localizatio