Sign Language Translation (SLT) is a challenging task due to its cross-domain
nature, involving the translation of visual-gestural language to text. Many
previous methods employ an intermediate representation, i.e., gloss sequences,
to facilitate SLT, thus transforming it into a two-stage task of sign language
recognition (SLR) followed by sign language translation (SLT). However, the
scarcity of gloss-annotated sign language data, combined with the information
bottleneck in the mid-level gloss representation, has hindered the further
development of the SLT task. To address this challenge, we propose a novel
Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), which improves
SLT by inheriting language-oriented prior knowledge from pre-trained models,
without any gloss annotation assistance. Our approach involves two stages: (i)
integrating Contrastive Language-Image Pre-training (CLIP) with masked
self-supervised learning to create pre-tasks that bridge the semantic gap
between visual and textual representations and restore masked sentences, and
(ii) constructing an end-to-end architecture with an encoder-decoder-like
structure that inherits the parameters of the pre-trained Visual Encoder and
Text Decoder from the first stage. The seamless combination of these novel
designs forms a robust sign language representation and significantly improves
gloss-free sign language translation. In particular, we have achieved
unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset
(>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free
SLT methods. Furthermore, our approach also achieves competitive results on the
PHOENIX14T dataset when compared with most of the gloss-based methods. Our code
is available at this https URL

基于视觉 - 语言预训练的无手语互译（GFSLT-VLP）方法通过结合对比式语言 - 图像预训练（CLIP）和掩码自监督学习，构建了一个端到端的模型，实现了在 PHOENIX14T 数据集上的 BLEU-4 分数大幅提升（> +5），取得了与最先进的无手语互译方法相当的竞争性结果。