Training models to act as agents that can effectively navigate and perform
actions in a complex environment, such as a web browser, has typically been
challenging due to lack of training data. Large language models (LLMs) have
recently demonstrated some capability to navigate novel environments as agents
in a zero-shot or few-shot fashion, purely guided by natural language
instructions as prompts. Recent research has also demonstrated LLMs have the
capability to exceed their base performance through self-improvement, i.e.
fine-tuning on data generated by the model itself. In this work, we explore the
extent to which LLMs can self-improve their performance as agents in
long-horizon tasks in a complex environment using the WebArena benchmark. In
WebArena, an agent must autonomously navigate and perform actions on web pages
to achieve a specified objective. We explore fine-tuning on three distinct
synthetic training data mixtures and achieve a 31\% improvement in task
completion rate over the base model on the WebArena benchmark through a
self-improvement procedure. We additionally contribute novel evaluation metrics
for assessing the performance, robustness, capabilities, and quality of
trajectories of our fine-tuned agent models to a greater degree than simple,
aggregate-level benchmark scores currently used to measure self-improvement.

通过在复杂环境中使用 WebArena 基准测试，我们探索了大语言模型在长期任务中作为代理人自我提升性能的程度，通过自我改进的方式，在三种不同的合成训练数据混合情况下，我们实现了在 WebArena 基准测试中任务完成率的 31％提高，并额外提供了用于评估我们精调代理模型的性能、鲁棒性、功能和轨迹质量的新型评价指标。