pretrained vision-language models have shown effectiveness in video
understanding. However, recent studies have not sufficiently leveraged
essential temporal information from videos, simply averaging frame-wise
representations or referencing consecutive frames. We introduce Temporally