Neural image/video captioning models can generate accurate descriptions, but
their internal process of mapping regions to words is a black box and therefore
difficult to explain. Top-down neural saliency methods can find important
regions given a high-level semantic task such as object classification, but
cannot use a natural language sentence as the top-dow