AbstractHumans are able to describe image contents with coarse to fine details as they wish. However, most
image captioning models are intention-agnostic which can not generate diverse descriptions according to different user intentions initiatively. In this work, we propose the
→