Despite exciting progress in pre-training for visual-linguistic (VL)
representations, very few aspire to a small VL model. In this paper, we study
knowledge distillation (KD) to effectively compress a transformer-based large
VL model into a small VL model. The major challenge arises fr