Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. document visual question answering (Document VQA), due to this multi-modal nature, has garnered