AbstractExisting attention mechanisms either attend to local image grid or object level features for
visual question answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a novel
→