This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around $K$ components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

本论文探讨了使用条件变分自编码器（CVAEs）生成图像字幕。我们提出了两种模型，分别基于高斯混合模型（GMM）假设和一种线性组合均值的新型加性高斯（AG）假设来构造潜在空间，从而创造多种内容类型的图像的先验分布。与LSTM基线或“vanilla” CVAE相比，我们展示了这两种模型产生了更多样化和更准确的字幕，特别是AG-CVAE表现得尤为优异。

使用具有加性高斯编码空间的变分自编码器的多样且准确的图像描述