An image caption should fluently present the essential information in a given
image, including informative, fine-grained entity mentions and the manner in
which these entities interact. However, current captioning models are usually
trained to generate captions that only contain common object names, thus
falling short on an important "informativeness" dimens