make-up temporal video grounding (MTVG) aims to localize the target video
segment which is semantically related to a sentence describing a make-up
activity, given a long video. Compared with the general video grounding task,
MTVG focuses on meticulous actions and changes on the face. T