Despite considerable progress, state of the art image captioning models
produce generic captions, leaving out important image details. Furthermore,
these systems may even misrepresent the image in order to produce a simpler
caption consisting of common concepts. In this paper, we first