Recent advances in the diffusion models have significantly improved
text-to-image generation. However, generating videos from text is a more
challenging task than generating images from text, due to the much larger
dataset and higher computational cost required. Most existing video gen