Recently, the astonishing performance of large language models (LLMs) in
natural language comprehension and generation tasks triggered lots of
exploration of using them as central controllers to build agent systems.
Multiple studies focus on bridging the LLMs to external tools to extend the
application scenarios. However, the current LLMs' perceiving tool-use ability
is limited to a single text query, which may result in ambiguity in
understanding the users' real intentions. LLMs are expected to eliminate that
by perceiving the visual- or auditory-grounded instructions' information.
Therefore, in this paper, we propose Tool-LMM, a system incorporating
open-source LLMs and multi-modal encoders so that the learnt LLMs can be
conscious of multi-modal input instruction and then select the function-matched
tool correctly. To facilitate the evaluation of the model's capability, we
collect a dataset featured by consisting of multi-modal input tools from
HuggingFace. Another important feature of our dataset is that our dataset also
contains multiple potential choices for the same instruction due to the
existence of identical functions and synonymous functions, which provides more
potential solutions for the same query. The experiments reveal that our LMM is
capable of recommending appropriate tools for multi-modal instructions. Codes
and data are available at this https URL

通过使用多模态编码器将开源大语言模型（LLM）与多模态输入指令结合起来，我们提出了 Tool-LMM 系统，使学习的 LLMs 能够意识到多模态输入指令并正确选择匹配功能的工具，实验证明我们的 LMM 能够为多模态指令推荐适当的工具。

Tool-LMM：一个用于工具智能学习的大型多模态模型

Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning

Large language models have shown their remarkable capabilities as a general
interface for various language-related applications. Motivated by this, we
target to build a unified interface for completing many vision-language tasks
including image description, visual question answering, and visual grounding,
among others. The challenge is to use a single model for performing diverse
vision-language tasks effectively with simple multi-modal instructions. Towards
this objective, we introduce MiniGPT-v2, a model that can be treated as a
unified interface for better handling various vision-language tasks. We propose
using unique identifiers for different tasks when training the model. These
identifiers enable our model to better distinguish each task instruction
effortlessly and also improve the model learning efficiency for each task.
After the three-stage training, the experimental results show that MiniGPT-v2
achieves strong performance on many visual question-answering and visual
grounding benchmarks compared to other vision-language generalist models. Our
model and codes are available at this https URL

利用 MiniGPT-v2 建立一个统一的界面，有效地处理各种视觉 - 语言任务，包括图像描述、视觉问答和视觉定位等，并通过使用唯一标识符提高模型在每个任务中的学习效率。

MiniGPT-v2：大型语言模型作为视觉语言多任务学习的统一接口

MiniGPT-v2: large language model as a unified interface for  vision-language multi-task learning

To realize human-robot collaboration, robots need to execute actions for new
tasks according to human instructions given finite prior knowledge. Human
experts can share their knowledge of how to perform a task with a robot through
multi-modal instructions in their demonstrations, showing a sequence of
short-horizon steps to achieve a long-horizon goal. This paper introduces a
method for robot action sequence generation from instruction videos using (1)
an audio-visual Transformer that converts audio-visual features and instruction
speech to a sequence of robot actions called dynamic movement primitives (DMPs)
and (2) style-transfer-based training that employs multi-task learning with
video captioning and weakly-supervised learning with a semantic classifier to
exploit unpaired video-action data. We built a system that accomplishes various
cooking actions, where an arm robot executes a DMP sequence acquired from a
cooking video using the audio-visual Transformer. Experiments with
Epic-Kitchen-100, YouCookII, QuerYD, and in-house instruction video datasets
show that the proposed method improves the quality of DMP sequences by 2.3
times the METEOR score obtained with a baseline video-to-action Transformer.
The model achieved 32% of the task success rate with the task knowledge of the
object.

本文介绍一种从指令视频中生成机器人动作序列的方法，用于实现人机协作，并展示了该方法在各种烹饪动作中的成功率达到 32%。

基于风格转移的语音和视觉场景理解，用于机器人从视频中获取操作序列

Style-transfer based Speech and Audio-visual Scene Understanding for  Robot Action Sequence Acquisition from Videos

Foundation models have made significant strides in various applications,
including text-to-image generation, panoptic segmentation, and natural language
processing. This paper presents Instruct2Act, a framework that utilizes Large
Language Models to map multi-modal instructions to sequential actions for
robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model to
generate Python programs that constitute a comprehensive perception, planning,
and action loop for robotic tasks. In the perception section, pre-defined APIs
are used to access multiple foundation models where the Segment Anything Model
(SAM) accurately locates candidate objects, and CLIP classifies them. In this
way, the framework leverages the expertise of foundation models and robotic
abilities to convert complex high-level instructions into precise policy codes.
Our approach is adjustable and flexible in accommodating various instruction
modalities and input types and catering to specific task demands. We validated
the practicality and efficiency of our approach by assessing it on robotic
tasks in different scenarios within tabletop manipulation domains. Furthermore,
our zero-shot method outperformed many state-of-the-art learning-based policies
in several tasks. The code for our proposed approach is available at
this https URL, serving as a robust benchmark for
high-level robotic instruction tasks with assorted modality inputs.

本文介绍 Instruct2Act 框架，利用大型语言模型将多模态指令映射为机器人操作任务所需的 Python 代码，采用 Segment Anything Model (SAM) 和 CLIP 等基础模型有效定位和分类物体，实现高效的机器人操作策略。