In the realm of web agent research, achieving both generalization and
accuracy remains a challenging problem. Due to high variance in website
structure, existing approaches often fail. Moreover, existing fine-tuning and
in-context learning techniques fail to generalize across multiple websites. We
introduce Wilbur, an approach that uses a differentiable ranking model and a
novel instruction synthesis technique to optimally populate a black-box large
language model's prompt with task demonstrations from previous runs. To
maximize end-to-end success rates, we also propose an intelligent backtracking
mechanism that learns and recovers from its mistakes. Finally, we show that our
ranking model can be trained on data from a generative auto-curriculum which
samples representative goals from an LLM, runs the agent, and automatically
evaluates it, with no manual annotation. Wilbur achieves state-of-the-art
results on the WebVoyager benchmark, beating text-only models by 8% overall,
and up to 36% on certain websites. On the same benchmark, Wilbur is within 5%
of a strong multi-modal model despite only receiving textual inputs, and
further analysis reveals a substantial number of failures are due to
engineering challenges of operating the web.

Wilbur 使用可微分的排名模型和新颖的指令合成技术来优化黑盒大型语言模型的提示，通过从先前运行中的任务演示集合中获取任务演示，以实现最大化端到端成功率，并提供智能回溯机制以从错误中学习和恢复。Wilbur 在 WebVoyager 基准测试中取得了最新的成果，整体上比仅文本模型好 8％，在某些网站上最高达 36％。尽管仅接收文本输入，但在同一基准测试中，Wilbur 与强大的多模型仅相差 5％，并进一步分析显示许多失败是由于操作网络的工程挑战所导致的。

WILBUR：面向强健与准确网络代理的自适应上下文学习

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

We present a novel approach to automatically synthesize "wayfinding
instructions" for an embodied robot agent. In contrast to prior approaches that
are heavily reliant on human-annotated datasets designed exclusively for
specific simulation platforms, our algorithm uses in-context learning to
condition an LLM to generate instructions using just a few references. Using an
LLM-based Visual Question Answering strategy, we gather detailed information
about the environment which is used by the LLM for instruction synthesis. We
implement our approach on multiple simulation platforms including Matterport3D,
AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature.
We subjectively evaluate our approach via a user study and observe that 83.3%
of users find the synthesized instructions accurately capture the details of
the environment and show characteristics similar to those of human-generated
instructions. Further, we conduct zero-shot navigation with multiple approaches
on the REVERIE dataset using the generated instructions, and observe very close
correlation with the baseline on standard success metrics (< 1% change in SR),
quantifying the viability of generated instructions in replacing
human-annotated data. To the best of our knowledge, ours is the first
LLM-driven approach capable of generating "human-like" instructions in a
platform-agnostic manner, without requiring any form of training.

我们提出了一种在多个仿真平台上生成 ' 类人 ' 指令的基于 LLM 的方法，该方法不依赖于任何形式的训练，通过少数参考即可使用上下文学习来生成指令。

LLM 能生成类似人类的路线指引吗？迈向平台无关的具身指导综合

Can LLMs Generate Human-Like Wayfinding Instructions? Towards  Platform-Agnostic Embodied Instruction Synthesis

Finding an object of a specific class in an unseen environment remains an
unsolved navigation problem. Hence, we propose a hierarchical learning-based
method for object navigation. The top-level is capable of high-level planning,
and building a memory on a floorplan-level (e.g., which room makes the most
sense for the agent to visit next, where has the agent already been?). While
the lower-level is tasked with efficiently navigating between rooms and looking
for objects in them. Instructions can be provided to the agent using a simple
synthetic language. The top-level intelligently enhances the instructions in
order to make the overall task more tractable. Language grounding, mapping
instructions to visual observations, is performed by utilizing an additional
separate supervised trained goal assessment module. We demonstrate the
effectiveness of our method on a dynamic configurable domestic environment.

本文提出了一种层次化学习方法，包括高层的规划和记忆以及低层的房间导航和物品寻找，通过简单的合成语言为代理提供指令，同时使用另一个目标评估模块将指令映射到视觉观察中。在一个动态可配置的家庭环境中验证了该方法的有效性。