Large language model agents have exhibited exceptional performance across a
range of complex interactive tasks. Recent approaches have utilized tuning with
expert trajectories to enhance agent performance, yet they primarily
concentrate on outcome rewards, which may lead to errors or suboptimal actions
due to the absence of process supervision signals. In this paper, we introduce
the Iterative step-level Process Refinement (IPR) framework, which provides
detailed step-by-step guidance to enhance agent training. Specifically, we
adopt the Monte Carlo method to estimate step-level rewards. During each
iteration, the agent explores along the expert trajectory and generates new
actions. These actions are then evaluated against the corresponding step of
expert trajectory using step-level rewards. Such comparison helps identify
discrepancies, yielding contrastive action pairs that serve as training data
for the agent. Our experiments on three complex agent tasks demonstrate that
our framework outperforms a variety of strong baselines. Moreover, our
analytical findings highlight the effectiveness of IPR in augmenting action
efficiency and its applicability to diverse models.

采用 Monte Carlo 方法为 Iterative step-level Process Refinement (IPR) 框架提供步骤级奖励，通过与专家轨迹进行对比评估，从中识别差异并生成对比动作对，用于训练模型，实验证明该框架在提升效率方面优于其他基线模型。

每步严密观察！通过迭代的步骤级过程优化学习的 LLM Agent

Watch Every Step! LLM Agent Learning via Iterative Step-Level Process  Refinement

Open-source pre-trained Large Language Models (LLMs) exhibit strong language
understanding and generation capabilities, making them highly successful in a
variety of tasks. However, when used as agents for dealing with complex
problems in the real world, their performance is far inferior to large
commercial models such as ChatGPT and GPT-4. As intelligent agents, LLMs need
to have the capabilities of task planning, long-term memory, and the ability to
leverage external tools to achieve satisfactory performance. Various methods
have been proposed to enhance the agent capabilities of LLMs. On the one hand,
methods involve constructing agent-specific data and fine-tuning the models. On
the other hand, some methods focus on designing prompts that effectively
activate the reasoning abilities of the LLMs. We explore both strategies on the
7B and 13B models. We propose a comprehensive method for constructing
agent-specific data using GPT-4. Through supervised fine-tuning with
constructed data, we find that for these models with a relatively small number
of parameters, supervised fine-tuning can significantly reduce hallucination
outputs and formatting errors in agent tasks. Furthermore, techniques such as
multi-path reasoning and task decomposition can effectively decrease problem
complexity and enhance the performance of LLMs as agents. We evaluate our
method on five agent tasks of AgentBench and achieve satisfactory results.

通过构建特定于代理的数据和有监督微调模型，以及设计有效激活大型语言模型推理能力的提示方法，我们提出了一种综合的方法来提高大型语言模型作为代理的性能，并通过在 AgentBench 的五个代理任务上的评估取得了令人满意的结果。

通过调整和多分支推理增强低参数 LLMs 的普通代理能力

Enhancing the General Agent Capabilities of Low-Parameter LLMs through  Tuning and Multi-Branch Reasoning

Open large language models (LLMs) with great performance in various tasks
have significantly advanced the development of LLMs. However, they are far
inferior to commercial models such as ChatGPT and GPT-4 when acting as agents
to tackle complex tasks in the real world. These agent tasks employ LLMs as the
central controller responsible for planning, memorization, and tool
utilization, necessitating both fine-grained prompting methods and robust LLMs
to achieve satisfactory performance. Though many prompting methods have been
proposed to complete particular agent tasks, there is lack of research focusing
on improving the agent capabilities of LLMs themselves without compromising
their general abilities. In this work, we present AgentTuning, a simple and
general method to enhance the agent abilities of LLMs while maintaining their
general LLM capabilities. We construct AgentInstruct, a lightweight
instruction-tuning dataset containing high-quality interaction trajectories. We
employ a hybrid instruction-tuning strategy by combining AgentInstruct with
open-source instructions from general domains. AgentTuning is used to
instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show
that AgentTuning enables LLMs' agent capabilities without compromising general
abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent
tasks, demonstrating generalized agent capabilities. We open source the
AgentInstruct and AgentLM-7B, 13B, and 70B models at
this https URL , serving open and powerful alternatives
to commercial LLMs for agent tasks.

AgentTuning 是一种简单且通用的方法，可以提高大型语言模型在代理任务方面的能力，同时保持其一般能力。该方法通过使用 AgentInstruct 与通用领域的开源指令相结合的混合指令调整策略对 Llama 2 系列进行了指令调整，从而得到 AgentLM。评估结果显示，AgentTuning 能够提升语言模型的代理能力而不影响其一般能力，AgentLM-70B 在未知代理任务上与 GPT-3.5-turbo 相媲美，展现了广义的代理能力。我们在指定的网址开源了 AgentInstruct 和 AgentLM-7B、13B 和 70B 模型，为代理任务提供了开源和强大的替代方案。