In this paper, we abandon the dominant complex language model and rethink the linguistic learning process in the scene text recognition. Different from previous methods considering the visual and linguistic information in two separate structures, we propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union by directly enduing the vision model with language capability. Specially, we introduce the text recognition of character-wise occluded feature maps in the training stage. Such operation guides the vision model to use not only the visual texture of characters, but also the linguistic information in visual context for recognition when the visual cues are confused (e.g. occlusion, noise, etc.). As the linguistic information is acquired along with visual features without the need of extra language model, VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition. Furthermore, an Occlusion Scene Text (OST) dataset is proposed to evaluate the performance on the case of missing character-wise visual cues. The state of-the-art results on several benchmarks prove our effectiveness. Code and dataset are available at https://github.com/wangyuxin87/VisionLAN.

本文提出一种名为VisionLAN的可提升文字识别速度与精度的模型，结合视觉和语言信息的Visual Language Modeling，以直接赋予形象模型带有语言能力，从而在训练阶段引导视觉模型利用视觉文本特征以及上下文信息的语言能力进行字符特征判断，绕过视觉噪声等干扰因素。本文中的Occlusion Scene Text数据集通过缺失部分字符信息的数据，进一步验证了我们的模型在复杂环境下的有效性。

从二到一：一种具有视觉语言建模网络的新型场景文本识别器