In this paper, we propose a novel conditional generative adversarial nets based image captioning framework as an extension of traditional reinforcement learning (RL) based encoder-decoder architecture. To deal with the inconsistent evaluation problem between objective language metrics and subjective human judgements, we are inspired to design some "discriminator" networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architecture (CNN and RNN based structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing encoder-decoder based image captioning model and we show that conventional RL training method is just a special case of our framework. Empirically, we show consistent improvements over all language evaluation metrics for different stage-of-the-art image captioning models.

本文提出了一种基于条件生成对抗网络的图像描述框架，添加了“辨别器”网络去逐步判断生成的描述是人类描述还是机器生成的，该算法是通用的，能够提高任何现有的基于RL的图像描述框架，实验表明，这种方法在不同的语言评估指标上有一致的改进。

使用条件生成式对抗网络改进图像标注