Sign language recognition (SLR) plays a vital role in facilitating
communication for the hearing-impaired community. SLR is a weakly supervised
task where entire videos are annotated with glosses, making it challenging to
identify the corresponding gloss within a video segment. Recent studies
indicate that the main bottleneck in SLR is the insufficient training caused by
the limited availability of large-scale datasets. To address this challenge, we
present SignVTCL, a multi-modal continuous sign language recognition framework
enhanced by visual-textual contrastive learning, which leverages the full
potential of multi-modal data and the generalization ability of language model.
SignVTCL integrates multi-modal data (video, keypoints, and optical flow)
simultaneously to train a unified visual backbone, thereby yielding more robust
visual representations. Furthermore, SignVTCL contains a visual-textual
alignment approach incorporating gloss-level and sentence-level alignment to
ensure precise correspondence between visual features and glosses at the level
of individual glosses and sentence. Experimental results conducted on three
datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL
achieves state-of-the-art results compared with previous methods.

利用多模态数据和语言模型的泛化能力，通过视觉 - 文本对比学习，提出了一种多模态连续手语识别框架 SignVTCL，它整合了视频、关键点和光流等多模态数据，训练了统一的视觉骨干并获得更强大的视觉表示，同时通过视觉 - 文本对齐方法在词汇和句级别确保视觉特征与手语之间的精确对应，实验结果表明，SignVTCL 在三个数据集上取得了领先于之前方法的最新成果。