Scene text image super-resolution (STISR), aiming to improve image quality
while boosting downstream scene text recognition accuracy, has recently
achieved great success. However, most existing methods treat the foreground
(character regions) and background (non-character regions) equally in the
forward process, and neglect the disturbance from the complex background, thus
limiting the performance. To address these issues, in this paper, we propose a
novel method LEMMA that explicitly models character regions to produce
high-level text-specific guidance for super-resolution. To model the location
of characters effectively, we propose the location enhancement module to
extract character region features based on the attention map sequence. Besides,
we propose the multi-modal alignment module to perform bidirectional
visual-semantic alignment to generate high-quality prior guidance, which is
then incorporated into the super-resolution branch in an adaptive manner using
the proposed adaptive fusion module. Experiments on TextZoom and four scene
text recognition benchmarks demonstrate the superiority of our method over
other state-of-the-art methods. Code is available at
this https URL

本研究提出了一种名称为 LEMMA 的新方法，通过显式建模字符区域，生成具有高级文本特定引导的超分辨率图像，通过位置增强模块和多模态对齐模块提升字符区域的特征提取和视觉 - 语义对齐，并使用自适应融合模块将先验引导无缝融合到超分辨率分支中。在 TextZoom 和四个场景文本识别基准测试上的实验证明了本方法相对于其他最先进技术的优越性。

基于显式位置增强的鲁棒场景文本图像超分辨率

Towards Robust Scene Text Image Super-resolution via Explicit Location  Enhancement

Current mainstream vision-language (VL) tracking framework consists of three
parts, \ie a visual feature extractor, a language feature extractor, and a
fusion model. To pursue better performance, a natural modus operandi for VL
tracking is employing customized and heavier unimodal encoders, and multi-modal
fusion models. Albeit effective, existing VL trackers separate feature
extraction and feature integration, resulting in extracted features that lack
semantic guidance and have limited target-aware capability in complex
scenarios, \eg similar distractors and extreme illumination. In this work,
inspired by the recent success of exploring foundation models with unified
architecture for both natural language and computer vision tasks, we propose an
All-in-One framework, which learns joint feature extraction and interaction by
adopting a unified transformer backbone. Specifically, we mix raw vision and
language signals to generate language-injected vision tokens, which we then
concatenate before feeding into the unified backbone architecture. This
approach achieves feature integration in a unified backbone, removing the need
for carefully-designed fusion modules and resulting in a more effective and
efficient VL tracking framework. To further improve the learning efficiency, we
introduce a multi-modal alignment module based on cross-modal and intra-modal
contrastive objectives, providing more reasonable representations for the
unified All-in-One transformer backbone. Extensive experiments on five
benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M,
demonstrate the superiority of the proposed tracker against existing
state-of-the-arts on VL tracking. Codes will be made publicly available.

现有主流的视觉语言（VL）跟踪框架由三部分组成，即视觉特征提取器，语言特征提取器和融合模型。本文提出了一个全新的、一体化的框架，通过采用统一的 Transformer 骨干结构，学习联合特征提取和交互，实现了特征的统一融合，消除了特征集成和融合模块的需求，从而在视觉语言跟踪方面获得了更有效和高效的结果。