Modeling and understanding time remains a challenge in contemporary video
understanding models. With language emerging as a key driver towards powerful
generalization, it is imperative for foundational video-language models to have
a sense of time. In this paper, we consider a specific