Large language models (LLMs) have demonstrated remarkable language proficiency, but they face challenges when solving interactive tasks independently. Existing methods either rely on gradient access, which is often inaccessible in state-of-the-art LLMs like GPT-4, or necessitate diverse and high-quality in-context demonstrations. In this study, we propose LLM-PO, a novel approach that enables LLMs to address these tasks without gradient access or extensive demonstrations. The key idea is to maintain a text-based plan and ask LLMs to reflect on pros and cons of the current plan based on experience collected with it, to update the plan, and to collect more experiences with the new plan. Experiments on HotpotQA demonstrate that LLM-PO achieves higher or on par success rates compared to in-context learning (ICL) baselines while requiring less inference cost.

研究提出了LLM-PO，一种新方法，可以使LLMs在没有梯度访问或广泛演示的情况下解决交互式任务。该方法通过维护基于文本的计划并要求LLMs根据其采集的经验反思当前计划的优缺点，并根据LLMs的反馈来更新计划和收集更多的经验，从而解决交互式任务。在HotpotQA上的实验表明，LLM-PO的成功率比基于上下文的学习（ICL）基线更高或相当，同时需要更少的推理成本。

无梯度和演示的大型语言模型交互式任务的提示优化