AbstractIn order to answer semantically-complicated questions about an image, a
visual question answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a
→