Training an effective video-and-language model intuitively requires multiple
frames as model inputs. However, it is unclear whether using multiple frames is
beneficial to downstream tasks, and if yes, whether the performance gain is
worth the drastically-increased computation and memor