Inspired by speech recognition, recent state-of-the-art algorithms mostly
consider scene text recognition as a sequence prediction problem. Though
achieving excellent performance, these methods usually neglect an important
fact that text in images are actually distributed in two-dimens