Questions that require counting a variety of objects in images remain a major
challenge in visual question answering (VQA). The most common approaches to VQA
involve either classifying answers based on fixed length representations of
both the image and question or summing fractional co