Tactility provides crucial support and enhancement for the perception and
interaction capabilities of both humans and robots. Nevertheless, the
multimodal research related to touch primarily focuses on visual and tactile
modalities, with limited exploration in the domain of language. Beyond
vocabulary, sentence-level descriptions contain richer semantics. Based on
this, we construct a touch-language-vision dataset named TLV
(Touch-Language-Vision) by human-machine cascade collaboration, featuring
sentence-level descriptions for multimode alignment. The new dataset is used to
fine-tune our proposed lightweight training framework, TLV-Link (Linking Touch,
Language, and Vision through Alignment), achieving effective semantic alignment
with minimal parameter adjustments (1%). Project Page:
this https URL

通过人机级联协作构建了一个名为 TLV（触觉 - 语言 - 视觉）的触觉 - 语言 - 视觉数据集，其中包含用于多模态对齐的句级描述。利用该新数据集，使用我们提出的轻量级训练框架 TLV-Link（通过对齐链接触觉、语言和视觉）进行微小参数调整（1%）实现了有效的语义对齐。