Recent advances in high-fidelity virtual environments serve as one of the
major driving forces for building intelligent embodied agents to perceive,
reason and interact with the physical world. Typically, these environments
remain unchanged unless agents interact with them. However, in real-world
scenarios, agents might also face dynamically changing environments
characterized by unexpected events and need to rapidly take action accordingly.
To remedy this gap, we propose a new simulated embodied benchmark, called
HAZARD, specifically designed to assess the decision-making abilities of
embodied agents in dynamic situations. HAZARD consists of three unexpected
disaster scenarios, including fire, flood, and wind, and specifically supports
the utilization of large language models (LLMs) to assist common sense
reasoning and decision-making. This benchmark enables us to evaluate autonomous
agents' decision-making capabilities across various pipelines, including
reinforcement learning (RL), rule-based, and search-based methods in
dynamically changing environments. As a first step toward addressing this
challenge using large language models, we further develop an LLM-based agent
and perform an in-depth analysis of its promise and challenge of solving these
challenging tasks. HAZARD is available at this https URL

利用高保真虚拟环境的最新进展来建立智能化的具有知觉、推理和与物理世界交互能力的实体代理是推动力之一。我们提出了一种名为 HAZARD 的新的模拟实体评估标准，旨在评估动态情况下实体代理的决策能力。HAZARD 包括火灾、洪水和风等三个突发灾害场景，并特别支持使用大语言模型（LLMs）进行常识推理和决策。这个评估标准可以评估自主代理在动态变化的环境中的决策能力，包括强化学习（RL），基于规则的方法和基于搜索的方法。作为使用大语言模型解决这一挑战的第一步，我们进一步开发了一个基于 LLM 的代理并对其在解决这些困难任务方面的优势和挑战进行了深入分析。HAZARD 可在此 https 网址处获得。

HAZARD 挑战：动态环境下的体验决策制定

HAZARD Challenge: Embodied Decision Making in Dynamically Changing  Environments

Recent advancements in large language models (LLMs) have exhibited promising
performance in solving sequential decision-making problems. By imitating
few-shot examples provided in the prompts (i.e., in-context learning), an LLM
agent can interact with an external environment and complete given tasks
without additional training. However, such few-shot examples are often
insufficient to generate high-quality solutions for complex and long-horizon
tasks, while the limited context length cannot consume larger-scale
demonstrations. To this end, we propose an offline learning framework that
utilizes offline data at scale (e.g, logs of human interactions) to facilitate
the in-context learning performance of LLM agents. We formally define
LLM-powered policies with both text-based approaches and code-based approaches.
We then introduce an Offline Data-driven Discovery and Distillation (O3D)
framework to improve LLM-powered policies without finetuning. O3D automatically
discovers reusable skills and distills generalizable knowledge across multiple
tasks based on offline interaction data, advancing the capability of solving
downstream tasks. Empirical results under two interactive decision-making
benchmarks (ALFWorld and WebShop) demonstrate that O3D can notably enhance the
decision-making capabilities of LLMs through the offline discovery and
distillation process, and consistently outperform baselines across various LLMs
with both text-based-policy and code-based-policy.

我们提出了一种离线学习框架，利用大规模的离线数据（如人类互动日志）来改善大型语言模型的在上下文学习性能。我们通过文本和代码的方法形式化定义了基于大型语言模型的策略，并引入了一种离线数据驱动的发现和精炼框架（O3D），以改善大型语言模型的决策能力。在两个交互式决策基准测试中的实证结果表明，O3D 可以通过离线发现和精炼过程显著提升大型语言模型的决策能力，并在基于文本和代码的策略下持续优于基准模型。

O3D：用于大型语言模型的离线数据驱动发现与蒸馏的顺序决策制定

O3D: Offline Data-driven Discovery and Distillation for Sequential  Decision-Making with Large Language Models

There is a growing interest in using Large Language Models (LLMs) as agents
to tackle real-world tasks that may require assessing complex situations. Yet,
we have a limited understanding of LLMs' reasoning and decision-making
capabilities, partly stemming from a lack of dedicated evaluation benchmarks.
As negotiating and compromising are key aspects of our everyday communication
and collaboration, we propose using scorable negotiation games as a new
evaluation framework for LLMs. We create a testbed of diverse text-based,
multi-agent, multi-issue, semantically rich negotiation games, with easily
tunable difficulty. To solve the challenge, agents need to have strong
arithmetic, inference, exploration, and planning capabilities, while seamlessly
integrating them. Via a systematic zero-shot Chain-of-Thought prompting (CoT),
we show that agents can negotiate and consistently reach successful deals. We
quantify the performance with multiple metrics and observe a large gap between
GPT-4 and earlier models. Importantly, we test the generalization to new games
and setups. Finally, we show that these games can help evaluate other critical
aspects, such as the interaction dynamics between agents in the presence of
greedy and adversarial players.

使用可评分的协商游戏作为新的评估框架，系统化的零样本链式思考提示能够展示大型语言模型在协商中的能力和绩效差距。