The predominant approach to visual question answering (VQA) demands that the
model represents within its weights all of the information required to answer
any question about any image. Learning this information from any real training
set seems unlikely, and representing it in a reasona