Multi-modal large language models have demonstrated impressive performances
on most vision-language tasks. However, the model generally lacks the
understanding capabilities for specific domain data, particularly when it comes
to interpreting chart figures. This is mainly due to the lack of relevant
multi-modal instruction tuning datasets. In this article, we create a
high-quality instruction-tuning dataset leveraging GPT-4. We develop a
multi-step data generation process in which different steps are responsible for
generating tabular data, creating chart figures, and designing instruction
tuning data separately. Our method's flexibility enables us to generate
diverse, high-quality instruction-tuning data consistently and efficiently
while maintaining a low resource expenditure. Additionally, it allows us to
incorporate a wider variety of chart and task types not yet featured in
existing datasets. Next, we introduce ChartLlama, a multi-modal large language
model that we've trained using our created dataset. ChartLlama outperforms all
prior methods in ChartQA, Chart-to-text, and Chart-extraction evaluation
benchmarks. Additionally, ChartLlama significantly improves upon the baseline
in our specially compiled chart dataset, which includes new chart and task
types. The results of ChartLlama confirm the value and huge potential of our
proposed data generation method in enhancing chart comprehension.

通过创建高质量的指令调整数据集，并使用这个数据集训练多模态大型语言模型 ChartLlama，本研究提出的数据生成方法可以有效地提高图表理解能力，并在 ChartQA、图表转文本和图表提取等评估中明显超越以往的方法，证实了其巨大潜力。

ChartLlama: 图表理解和生成的多模态 LLM

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

Despite the advancements of open-source large language models (LLMs) and
their variants, e.g., LLaMA and Vicuna, they remain significantly limited in
performing higher-level tasks, such as following human instructions to use
external tools (APIs). This is because current instruction tuning largely
focuses on basic language tasks instead of the tool-use domain. This is in
contrast to state-of-the-art (SOTA) LLMs, e.g., ChatGPT, which have
demonstrated excellent tool-use capabilities but are unfortunately closed
source. To facilitate tool-use capabilities within open-source LLMs, we
introduce ToolLLM, a general tool-use framework of data construction, model
training and evaluation. We first present ToolBench, an instruction-tuning
dataset for tool use, which is created automatically using ChatGPT.
Specifically, we collect 16,464 real-world RESTful APIs spanning 49 categories
from RapidAPI Hub, then prompt ChatGPT to generate diverse human instructions
involving these APIs, covering both single-tool and multi-tool scenarios.
Finally, we use ChatGPT to search for a valid solution path (chain of API
calls) for each instruction. To make the searching process more efficient, we
develop a novel depth-first search-based decision tree (DFSDT), enabling LLMs
to evaluate multiple reasoning traces and expand the search space. We show that
DFSDT significantly enhances the planning and reasoning capabilities of LLMs.
For efficient tool-use assessment, we develop an automatic evaluator: ToolEval.
We fine-tune LLaMA on ToolBench and obtain ToolLLaMA. Our ToolEval reveals that
ToolLLaMA demonstrates a remarkable ability to execute complex instructions and
generalize to unseen APIs, and exhibits comparable performance to ChatGPT. To
make the pipeline more practical, we devise a neural API retriever to recommend
appropriate APIs for each instruction, negating the need for manual API
selection.

通过引入 ToolLLM，一个包括数据构建、模型训练和评估的通用工具使用框架，我们展示了它在增强自然语言模型的规划和推理能力方面的显著影响。我们通过用 ChatGPT 创建一个工具使用指导数据集 ToolBench，并使用深度优先搜索决策树（DFSDT）扩展搜索空间，有效地获取有效的解决方案路径。通过对 LLaMA 进行微调后得到 ToolLLaMA，我们的评估器 ToolEval 显示 ToolLLaMA 在执行复杂指令和推广到未见过的 API 方面表现出卓越的能力，并与 ChatGPT 有着相当的性能。为了使流程更加实用，我们设计了一个神经 API 检索器以为每个指令推荐适当的 API，省去了手动选择 API 的繁琐步骤。

ToolLLM: 促进大型语言模型掌握 16000 + 现实世界 API

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world  APIs

Large Language Models (LLMs) have shown enhanced capabilities of solving
novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT)
reasoning; how can we instill the same capability of reasoning step-by-step on
unseen tasks into LMs that possess less than <100B parameters? To address this
question, we first introduce the CoT Collection, a new instruction-tuning
dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show
that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables
the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an
improvement in the average zero-shot accuracy on 27 datasets of the
BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we
show that instruction tuning with CoT allows LMs to possess stronger few-shot
learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4
domain-specific tasks over Flan-T5 (3B & 11B), respectively. We make our CoT
Collection data and our trained models publicly available at
this https URL

通过不同程度的 fine-tuning，基于 Flan-T5 的大型语言模型学习了 Chain-of-Thought 推理并表现出更强的 few-shot learning 能力，使得在 27 个数据集上的平均零 - shot 准确率提高了 4.34％和 2.44％，并在 4 个特定领域的任务上获得了进一步的改进。