Aesthetic image captioning (AIC) refers to the multi-modal task of generating critical textual feedbacks for photographs. While in natural image captioning (NIC), deep models are trained in an end-to-end manner using large curated datasets such as MS-COCO, no such large-scale, clean dataset exists for AIC. Towards this goal, we propose an automatic cleaning strategy to create a benchmarking AIC dataset, by exploiting the images and noisy comments easily available from photography websites. We propose a probabilistic caption-filtering method for cleaning the noisy web-data, and compile a large-scale, clean dataset "AVA-Captions", (230, 000 images with 5 captions per image). Additionally, by exploiting the latent associations between aesthetic attributes, we propose a strategy for training the convolutional neural network (CNN) based visual feature extractor, the first component of the AIC framework. The strategy is weakly supervised and can be effectively used to learn rich aesthetic representations, without requiring expensive ground-truth annotations. We finally show-case a thorough analysis of the proposed contributions using automatic metrics and subjective evaluations.

本文介绍了如何通过使用基于网站提供的图片和嘈杂的评论的自动清洗策略创建一个用于美学图像标题生成的基准数据集（AVA-Captions）。同时，还介绍了一种概率的标题过滤方法，以及利用美学属性间的潜在关联性进行卷积神经网络（CNN）特征提取器的训练的策略。该策略是弱监督的，并可用于学习丰富的美学表示，无需昂贵的标注。最后，文章通过自动度量和主观评价展示了所提出贡献的全面分析。

来自弱标注照片的美学图像字幕生成