Multimodal Large Language Models (MLLMs) have demonstrated a wide range of
capabilities across many domains, including Embodied AI. In this work, we study
how to best ground a MLLM into different embodiments and their associated
action spaces, with the goal of leveraging the multimodal world knowledge of
the MLLM. We first generalize a number of methods through a unified
architecture and the lens of action space adaptors. For continuous actions, we
show that a learned tokenization allows for sufficient modeling precision,
yielding the best performance on downstream tasks. For discrete actions, we
demonstrate that semantically aligning these actions with the native output
token space of the MLLM leads to the strongest performance. We arrive at these
lessons via a thorough study of seven action space adapters on five different
environments, encompassing over 114 embodied tasks.

通过研究行为空间适配器，我们发现多模态大型语言模型在融入多种方法并处理连续行为和离散行为时可以获得最佳性能。

在行动中联系多模态大型语言模型

Grounding Multimodal Large Language Models in Actions

Being able to reason in an environment with a large number of discrete
actions is essential to bringing reinforcement learning to a larger class of
problems. Recommender systems, industrial plants and language models are only
some of the many real-world tasks involving large numbers of discrete actions
for which current methods are difficult or even often impossible to apply. An
ability to generalize over the set of actions as well as sub-linear complexity
relative to the size of the set are both necessary to handle such tasks.
Current approaches are not able to provide both of these, which motivates the
work in this paper. Our proposed approach leverages prior information about the
actions to embed them in a continuous space upon which it can generalize.
Additionally, approximate nearest-neighbor methods allow for logarithmic-time
lookup complexity relative to the number of actions, which is necessary for
time-wise tractable training. This combined approach allows reinforcement
learning methods to be applied to large-scale learning problems previously
intractable with current methods. We demonstrate our algorithm's abilities on a
series of tasks having up to one million actions.

本文提出一种基于近似最近邻方法和先前关于行动的信息的强化学习算法，将大量离散行动嵌入到连续空间中，从而实现对大规模学习问题的解决。