Pre-trained vision-language models are the de-facto foundation models for
various downstream tasks. However, this trend has not extended to the field of
scene text recognition (STR), despite the potential of CLIP to serve as a
powerful scene text reader. CLIP can robustly identify regular (horizontal) and
irregular (rotated, curved, blurred, or occluded) text in natural images. With
such merits, we introduce CLIP4STR, a simple yet effective STR method built
upon image and text encoders of CLIP. It has two encoder-decoder branches: a
visual branch and a cross-modal branch. The visual branch provides an initial
prediction based on the visual feature, and the cross-modal branch refines this
prediction by addressing the discrepancy between the visual feature and text
semantics. To fully leverage the capabilities of both branches, we design a
dual predict-and-refine decoding scheme for inference. CLIP4STR achieves new
state-of-the-art performance on 11 STR benchmarks. Additionally, a
comprehensive empirical study is provided to enhance the understanding of the
adaptation of CLIP to STR. We believe our method establishes a simple but
strong baseline for future STR research with VL models.

介绍了 CLIP4STR，一种基于 CLIP 的简单而有效的场景文字识别方法，通过图像和文本编码器构建，具有双预测和精炼解码机制。实验表明，该方法在 11 个 STR 基准测试中达到了最新的最佳性能。

CLIP4STR: 使用预训练视觉语言模型的场景文本识别的简单基线

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained  Vision-Language Model

We propose a novel non-parametric method for cross-modal recipe retrieval
which is applied on top of precomputed image and text embeddings. By combining
our method with standard approaches for building image and text encoders,
trained independently with a self-supervised classification objective, we
create a baseline model which outperforms most existing methods on a
challenging image-to-recipe task. We also use our method for comparing image
and text encoders trained using different modern approaches, thus addressing
the issues hindering the development of novel methods for cross-modal recipe
retrieval. We demonstrate how to use the insights from model comparison and
extend our baseline model with standard triplet loss that improves
state-of-the-art on the Recipe1M dataset by a large margin, while using only
precomputed features and with much less complexity than existing methods.
Further, our approach readily generalizes beyond recipe retrieval to other
challenging domains, achieving state-of-the-art performance on Politics and
GoodNews cross-modal retrieval tasks.

我们提出了一种新颖的非参数方法，用于跨模式菜谱检索，结合图像和文本嵌入，通过将我们的方法与用自监督分类目标独立训练的标准方法相结合，我们创建了一个基准模型，在挑战性的图像到菜谱任务上优于大多数现有方法。我们还使用我们的方法比较使用不同现代方法训练的图像和文本编码器，从而解决跨模式菜谱检索的问题。通过三元组损失增强基准模型，同时仅使用预计算特征并且比现有方法更简单，大幅提高了在 Recipe1M 数据集上的最新水平，并且我们的方法易于推广到其他具有挑战性的领域，在政治和 GoodNews 跨模态检索任务上实现了最先进的性能。