Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In
this paper, we study how to leverage them for zero-shot visual question
answering (VQA). Our approach is motivated by a few observations. First, VQA
questions often require multiple steps of reasoning, which is