AbstractHow does machine learn to reason about the content of a video in answering a question? A
video qa system must simultaneously understand language, represent visual content over space-time, and iteratively transform these representations in response to
→