The problem of visual question answering (VQA) requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on recurrent LSTM networks to this problem, but have failed to model