As language models (LMs) demonstrate their capabilities in various fields,
their application to tasks requiring multi-round interactions has become
increasingly popular. These tasks usually have complex dynamics, so supervised
fine-tuning (SFT) on a limited offline dataset does not yield good performance.
However, only a few works attempted to directly train the LMs within
interactive decision-making environments. We aim to create an effective
mechanism to fine-tune LMs with online reinforcement learning (RL) in these
environments. We propose Reflect-RL, a two-player system to fine-tune an LM
using online RL, where a frozen reflection model assists the policy model. To
generate data for the warm-up SFT stage, we use negative example generation to
enhance the error-correction ability of the reflection model. Furthermore, we
designed single-prompt action enumeration and applied curriculum learning to
allow the policy model to learn more efficiently. Empirically, we verify that
Reflect-RL outperforms SFT and online RL without reflection. Testing results
indicate GPT-2-xl after Reflect-RL also outperforms those of untuned
pre-trained LMs, such as Mistral 7B.

使用在线强化学习引导反射模型辅助多轮交互决策中的预训练语言模型，通过单提示动作枚举和课程学习来提高性能。实验证实了 Reflect-RL 在在线学习中的有效性，并显示其在性能上优于通常的 SFT 和无反射的在线 RL 方法。

Reflect-RL: 用于语言模型的两人在线强化学习微调

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Prompt Engineering (PE) has emerged as a critical technique for guiding Large
Language Models (LLMs) in solving intricate tasks. Its importance is
highlighted by its potential to significantly enhance the efficiency and
effectiveness of human-machine interaction. As tasks grow increasingly complex,
recent advanced PE methods have extended beyond the limitations of single-round
interactions to embrace multi-round interactions, which allows for a deeper and
more nuanced engagement with LLMs. In this paper, we propose an optimal control
framework tailored for multi-round interactions with LLMs. This framework
provides a unified mathematical structure that not only systematizes the
existing PE methods but also sets the stage for rigorous analytical
improvements. Furthermore, we extend this framework to include PE via ensemble
methods and multi-agent collaboration, thereby enlarging the scope of
applicability. By adopting an optimal control perspective, we offer fresh
insights into existing PE methods and highlight theoretical challenges that
warrant future research. Besides, our work lays a foundation for the
development of more effective and interpretable PE methods.

为解决复杂任务并提升人机交互效率，本研究通过最优控制框架，提出了多轮与大型语言模型的交互技术，包括扩展到多个回合的交互、合奏方法和多智能体协作，以系统化现有的 Prompt Engineering 方法，并探索理论挑战和更有效、可解释的方法的发展基础。

透过最优控制的镜头看待提示工程

Prompt Engineering Through the Lens of Optimal Control

Interactive video object segmentation (iVOS) aims at efficiently harvesting
high-quality segmentation masks of the target object in a video with user
interactions. Most previous state-of-the-arts tackle the iVOS with two
independent networks for conducting user interaction and temporal propagation,
respectively, leading to inefficiencies during the inference stage. In this
work, we propose a unified framework, named Memory Aggregation Networks
(MA-Net), to address the challenging iVOS in a more efficient way. Our MA-Net
integrates the interaction and the propagation operations into a single
network, which significantly promotes the efficiency of iVOS in the scheme of
multi-round interactions. More importantly, we propose a simple yet effective
memory aggregation mechanism to record the informative knowledge from the
previous interaction rounds, improving the robustness in discovering
challenging objects of interest greatly. We conduct extensive experiments on
the validation set of DAVIS Challenge 2018 benchmark. In particular, our MA-Net
achieves the J@60 score of 76.1% without any bells and whistles, outperforming
the state-of-the-arts with more than 2.7%.

本研究提出了一种称为记忆聚合网络的统一框架，以更高效的方式解决交互式视频对象分割的问题，通过将交互和传播操作整合到单个网络中，并提出一种简单而有效的记忆聚合机制，大大提高了发现有挑战性的感兴趣对象的鲁棒性。在 DAVIS Challenge 2018 基准验证集上进行了广泛的实验，特别地，我们的 MA-Net 在没有任何更多附加的部分下达到了 76.1% 的 J@60 分数，超过了最先进的技术 2.7% 以上。