In this paper, we introduce a method for unifying language, action, and state
information in a shared embedding space to facilitate a range of downstream
tasks in robot learning. Our method, Contrastive Language, Action, and State
Pre-training (CLASP), extends the CLIP formulation by incorporating
distributional learning, capturing the inherent complexities and one-to-many
relationships in behaviour-text alignment. By employing distributional outputs
for both text and behaviour encoders, our model effectively associates diverse
textual commands with a single behaviour and vice-versa. We demonstrate the
utility of our method for the following downstream tasks: zero-shot
text-behaviour retrieval, captioning unseen robot behaviours, and learning a
behaviour prior for language-conditioned reinforcement learning. Our
distributional encoders exhibit superior retrieval and captioning performance
on unseen datasets, and the ability to generate meaningful exploratory
behaviours from textual commands, capturing the intricate relationships between
language, action, and state. This work represents an initial step towards
developing a unified pre-trained model for robotics, with the potential to
generalise to a broad range of downstream tasks.

本文介绍了一种名为 Contrastive Language, Action, and State Pre-training (CLASP) 的方法，采用分布式输出使文本命令和行为单元之间的对齐变得更加准确，从而帮助了解决机器人学习中的相关问题。该模型在对未知数据集进行检索和图像描述生成等任务中表现出了优秀的性能。

对比语言、动作和状态预训练用于机器人学习

Contrastive Language, Action, and State Pre-training for Robot Learning

We present lilGym, a new benchmark for language-conditioned reinforcement
learning in visual environments. lilGym is based on 2,661 highly-compositional
human-written natural language statements grounded in an interactive visual
environment. We introduce a new approach for exact reward computation in every
possible world state by annotating all statements with executable Python
programs. Each statement is paired with multiple start states and reward
functions to form thousands of distinct Markov Decision Processes of varying
difficulty. We experiment with lilGym with different models and learning
regimes. Our results and analysis show that while existing methods are able to
achieve non-trivial performance, lilGym forms a challenging open problem.
lilGym is available at this https URL.

lilGym 是一个基于自然语言生成环境的强化学习基准，通过在每个可能世界状态上注释所有语句以确切计算奖励，创建了数千个不同难度的马尔可夫决策过程，并使用不同模型和学习体系结构实验和分析表明，lilGym 是一个具有挑战性的开放问题。