While image manipulation achieves tremendous breakthroughs (e.g., generating
realistic faces) in recent years, video generation is much less explored and
harder to control, which limits its applications in the real world. For
instance, video editing requires temporal coherence across m