In this paper, we study the problem of temporal video grounding (TVG), which
aims to predict the starting/ending time points of moments described by a text
sentence within a long untrimmed video. Benefiting from fine-grained 3D visual
features, the TVG techniques have achieved remarkab