text-based image captioning (TextCap) which aims to read and reason images
with texts is crucial for a machine to understand a detailed and complex scene
environment, considering that texts are omnipresent in daily life. This task,
however, is very challenging because an image often co