We introduce a new task, video-and-language inference, for joint multimodal
understanding of video and text. Given a video clip with aligned subtitles as
premise, paired with a natural language hypothesis based on the video content,
a model needs to infer whether the hypothesis is enta