Most existing scene text detectors focus on detecting characters or words
that only capture partial text messages due to missing contextual information.
For a better understanding of text in scenes, it is more desired to detect
contextual text blocks (CTBs) which consist of one or mult