There have been wide spread claims in the literature about the emergent
reasoning capabilities of Pretrained Large Language Models. However, recent
studies, have found that their ability to plan remains questionable. Through
our experiments using GPT-2, we empirically demonstrate that the performance of
a finetuned baseline remains poor because it violates pre-conditions of actions
in the plans that it generates. To improve the planning capabilities of a
finetuned LLM, we train a verifier, which can classify actions as being valid
or invalid in a particular state. By randomly sampling actions from the same
dataset, we generate examples of invalid actions which are then used to train a
verifier which can check for action applicability. In the presence of diverse
sampling from a generator and a verifier which can prune invalid trajectories,
we show significant gains in the success rate on the Blocksworld domain.
Additionally, we show that finetuning the GPT-2 generator itself to create the
verifier generalizes better than finetuning the base GPT-2. Lastly, we
investigate the role of the sampling temperature which can be used to control
the exploration-exploitation tradeoff.

本论文通过使用 GPT-2 实验来证明，预训练的大型语言模型在计划方面的表现较差，研究人员首先建立了一个验证器在特定状态下对行动的适用性进行分类，然后在生成器中随机抽样无效动作来训练验证器，在生成器和验证器的共同作用下，取得了不错的成果。

利用验证器提高预训练语言模型的规划能力

Learning and Leveraging Verifiers to Improve Planning Capabilities of  Pre-trained Language Models

In recent years, Deep Reinforcement Learning (DRL) algorithms have achieved
state-of-the-art performance in many challenging strategy games. Because these
games have complicated rules, an action sampled from the full discrete action
distribution predicted by the learned policy is likely to be invalid according
to the game rules (e.g., walking into a wall). The usual approach to deal with
this problem in policy gradient algorithms is to "mask out" invalid actions and
just sample from the set of valid actions. The implications of this process,
however, remain under-investigated. In this paper, we 1) show theoretical
justification for such a practice, 2) empirically demonstrate its importance as
the space of invalid actions grows, and 3) provide further insights by
evaluating different action masking regimes, such as removing masking after an
agent has been trained using masking. The source code can be found at
this https URL

本文研究探讨针对复杂的规则游戏，使用深度强化学习算法时，如何解决学习出的策略生成的无效动作问题，给出了合理的理论支持，实证了有效性，并给出了不同的行动遮罩方案的评估。