End-to-end scene text spotting has attracted great attention in recent years
due to the success of excavating the intrinsic synergy of the scene text
detection and recognition. However, recent state-of-the-art methods usually
incorporate detection and recognition simply by sharing the backbone, which
does not directly take advantage of the feature interaction between the two
tasks. In this paper, we propose a new end-to-end scene text spotting framework
termed SwinTextSpotter. Using a transformer encoder with dynamic head as the
detector, we unify the two tasks with a novel Recognition Conversion mechanism
to explicitly guide text localization through recognition loss. The
straightforward design results in a concise framework that requires neither
additional rectification module nor character-level annotation for the
arbitrarily-shaped text. Qualitative and quantitative experiments on
multi-oriented datasets RoIC13 and ICDAR 2015, arbitrarily-shaped datasets
Total-Text and CTW1500, and multi-lingual datasets ReCTS (Chinese) and VinText
(Vietnamese) demonstrate SwinTextSpotter significantly outperforms existing
methods. Code is available at this https URL

本文提出了一种使用 transformer encoding 的新型端到端场景文本识别框架，并通过一种新的识别转换机制，在不需要额外的矫正模块或字符级别注释的情况下显式地引导文本定位，从而使该方法在多种数据集上显著优于现有方法。