Complex planning and scheduling problems have long been solved using various
optimization or heuristic approaches. In recent years, imitation learning that
aims to learn from expert demonstrations has been proposed as a viable
alternative to solving these problems. Generally speaking, imitation learning
is designed to learn either the reward (or preference) model or directly the
behavioral policy by observing the behavior of an expert. Existing work in
imitation learning and inverse reinforcement learning has focused on imitation
primarily in unconstrained settings (e.g., no limit on fuel consumed by the
vehicle). However, in many real-world domains, the behavior of an expert is
governed not only by reward (or preference) but also by constraints. For
instance, decisions on self-driving delivery vehicles are dependent not only on
the route preferences/rewards (depending on past demand data) but also on the
fuel in the vehicle and the time available. In such problems, imitation
learning is challenging as decisions are not only dictated by the reward model
but are also dependent on a cost-constrained model. In this paper, we provide
multiple methods that match expert distributions in the presence of trajectory
cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to
find a good trade-off between expected return and minimizing constraint
violation; and (c) Cost-violation-based alternating gradient. We empirically
show that leading imitation learning approaches imitate cost-constrained
behaviors poorly and our meta-gradient-based approach achieves the best
performance.

通过拉格朗日方法、元梯度以及基于成本违规的交替梯度等多种方法，我们在考虑轨迹成本约束的情况下成功匹配了专家分布，并且在实证研究中证明了我们的元梯度方法具有最佳性能。

在强化学习中模仿受成本约束的行为

Imitating Cost-Constrained Behaviors in Reinforcement Learning

In this paper, we take the initiative to investigate the performance of LLMs
on complex planning tasks that require LLMs to understand a virtual spatial
environment simulated via natural language and act correspondingly in text. We
propose a benchmark named Natural Language Planning (NLP) composed of a set of
novel tasks: Brick World, NLVR-based Manipulations, and Natural Language
Navigation. We found that current popular LLMs such as ChatGPT still lack
abilities in complex planning. This arises a question -- do the LLMs have a
good understanding of the environments described in natural language, or maybe
other alternatives such as symbolic representations are neater and hence better
to be understood by LLMs? To this end, we propose a novel method called CoS
(Chain-of-Symbol Prompting) that represents the complex environments with
condensed symbolic spatial representations during the chained intermediate
thinking steps. CoS is easy to use and does not need additional training on
LLMs. Extensive experiments indicate that CoS clearly surpasses the performance
of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even
fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT.
The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%)
on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt
obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate
steps from demonstrations on Brick World.

本文提出了一种名为自然语言计划（NLP）的基准测试，由包含新颖任务的 Brick World、基于 NLVR 的操作和自然语言导航组成，着重研究 LLMs 在需要理解自然语言描述的虚拟空间环境并进行相应文本操作的复杂计划任务中的表现，发现常规的 ChatGPT 等 LLMs 缺乏复杂计划的能力，因此提出了一种适用于 LLMs 的新方法 CoS，可以更好地表示符号空间表示方法，并在三个计划任务中显著提高了 ChatGPT 的性能。