We study how to generate captions that are not only accurate in describing an
image but also discriminative across different images. The problem is both
fundamental and interesting, as most machine-generated captions, despite
phenomenal research progresses in the past several years, are expressed in a
very monotonic and featureless format. While such caption