image captioning has emerged as an interesting research field in recent years
due to its broad application scenarios. The traditional paradigm of image
captioning relies on paired image-caption datasets to train the model in a
supervised manner. However, creating such paired datasets f