Pre-trained vision-language models are the de-facto foundation models for
various downstream tasks. However, this trend has not extended to the field of
scene text recognition (STR), despite the potential of CLIP to serve as a
powerful scene text reader. CLIP can robustly identify regular (horizontal) and
irregular (rotated, curved, blurred, or occluded) text in natural images. With
such merits, we introduce CLIP4STR, a simple yet effective STR method built
upon image and text encoders of CLIP. It has two encoder-decoder branches: a
visual branch and a cross-modal branch. The visual branch provides an initial
prediction based on the visual feature, and the cross-modal branch refines this
prediction by addressing the discrepancy between the visual feature and text
semantics. To fully leverage the capabilities of both branches, we design a
dual predict-and-refine decoding scheme for inference. CLIP4STR achieves new
state-of-the-art performance on 11 STR benchmarks. Additionally, a
comprehensive empirical study is provided to enhance the understanding of the
adaptation of CLIP to STR. We believe our method establishes a simple but
strong baseline for future STR research with VL models.

介绍了 CLIP4STR，一种基于 CLIP 的简单而有效的场景文字识别方法，通过图像和文本编码器构建，具有双预测和精炼解码机制。实验表明，该方法在 11 个 STR 基准测试中达到了最新的最佳性能。