基于Transformer的大规模预训练文字到视频生成技术CogVideo

May, 2022

基于Transformer的大规模预训练文字到视频生成技术CogVideo

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

TL;DR该研究提出CogVideo，一个9B参数的transformer预训练模型，通过继承预训练的文本到图像模型CogView2进行训练，同时采用多帧率层次化训练策略以更好地对齐文本和视频片段。作为可能是第一个开源的大规模预训练文本到视频模型，CogVideo在机器和人类评估中的表现均远超公开的模型。

Abstract

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scar