Dominant scene text recognition models commonly contain two building blocks,
a visual model for feature extraction and a sequence model for text
transcription. This hybrid architecture, although accurate, is complex and less
efficient. In this study, we propose a Single Visual model for Scene Text
recognition within the patch-wise image tokenization framework, which dispenses
with the sequential modeling entirely. The method, termed SVTR, firstly
decomposes an image text into small patches named character components.
Afterward, hierarchical stages are recurrently carried out by component-level
mixing, merging and/or combining. Global and local mixing blocks are devised to
perceive the inter-character and intra-character patterns, leading to a
multi-grained character component perception. Thus, characters are recognized
by a simple linear prediction. Experimental results on both English and Chinese
scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L
(Large) achieves highly competitive accuracy in English and outperforms
existing methods by a large margin in Chinese, while running faster. In
addition, SVTR-T (Tiny) is an effective and much smaller model, which shows
appealing speed at inference. The code is publicly available at
this https URL

本研究提出一种基于补丁式图像记号化框架的单一视觉模型，用于场景文本识别，其通过组成部分级别的混合、合并和 / 或组合，实现全局和局部混合块，以感知字符之间和字符内部的模式，从而通过简单的线性预测识别字符。实验结果表明，在英语和中文场景文本识别任务上，SVTR-L（大型）实现了高竞争准确性，在中文上大幅优于现有的方法，在代码实现方面表现出更快的速度。