Recent methodologies in LLM self-training mostly rely on LLM generating
responses and filtering those with correct output answers as training data.
This approach often yields a low-quality fine-tuning training set (e.g.,
incorrect plans or intermediate reasoning). In this paper, we develop a
reinforced self-training approach, called ReST-MCTS*, based on integrating
process reward guidance with tree search MCTS* for collecting higher-quality
reasoning traces as well as per-step value to train policy and reward models.
ReST-MCTS* circumvents the per-step manual annotation typically used to train
process rewards by tree-search-based reinforcement learning: Given oracle final
correct answers, ReST-MCTS* is able to infer the correct process rewards by
estimating the probability this step can help lead to the correct answer. These
inferred rewards serve dual purposes: they act as value targets for further
refining the process reward model and also facilitate the selection of
high-quality traces for policy model self-training. We first show that the
tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior
LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same
search budget. We then show that by using traces searched by this tree-search
policy as training data, we can continuously enhance the three language models
for multiple iterations, and outperform other self-training algorithms such as
ReST$^\text{EM}$ and Self-Rewarding LM.

基于强化学习的 ReST-MCTS* 方法结合过程奖励模型与树搜索 MCTS*，获取高质量的推理轨迹用于训练策略和奖励模型，在 LLM 自我训练中取得了更高的准确性和性能。

ReST-MCTS*: LLM 自训练通过过程奖励引导的树搜索

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

Visual program synthesis is a promising approach to exploit the reasoning
abilities of large language models for compositional computer vision tasks.
Previous work has used few-shot prompting with frozen LLMs to synthesize visual
programs. Training an LLM to write better visual programs is an attractive
prospect, but it is unclear how to accomplish this. No dataset of visual
programs for training exists, and acquisition of a visual program dataset
cannot be easily crowdsourced due to the need for expert annotators. To get
around the lack of direct supervision, we explore improving the program
synthesis abilities of an LLM using feedback from interactive experience. We
propose a method where we exploit existing annotations for a vision-language
task to improvise a coarse reward signal for that task, treat the LLM as a
policy, and apply reinforced self-training to improve the visual program
synthesis ability of the LLM for that task. We describe a series of experiments
on object detection, compositional visual question answering, and image-text
retrieval, and show that in each case, the self-trained LLM outperforms or
performs on par with few-shot frozen LLMs that are an order of magnitude
larger. Website: this https URL

利用交互式经验反馈改进大规模语言模型的视觉程序合成能力，通过利用现有的视觉语言任务注释为该任务创造一个粗略的奖励信号，将语言模型作为一种策略，并应用增强的自训练，显示出在对象检测、复合视觉问答和图像 - 文本检索方面，经过自训练的语言模型在每种情况下优于或与量级更大的少样本冻结的语言模型相媲美。

利用自我训练大型语言模型改进视觉程序合成与视觉强化

Self-Training Large Language Models for Improved Visual Program  Synthesis With Visual Reinforcement

Reinforcement learning from human feedback (RLHF) can improve the quality of
large language model's (LLM) outputs by aligning them with human preferences.
We propose a simple algorithm for aligning LLMs with human preferences inspired
by growing batch reinforcement learning (RL), which we call Reinforced
Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by
generating samples from the policy, which are then used to improve the LLM
policy using offline RL algorithms. ReST is more efficient than typical online
RLHF methods because the training dataset is produced offline, which allows
data reuse. While ReST is a general approach applicable to all generative
learning settings, we focus on its application to machine translation. Our
results show that ReST can substantially improve translation quality, as
measured by automated metrics and human evaluation on machine translation
benchmarks in a compute and sample-efficient manner.

ReST 是一种使用离线 RL 算法通过为 LLM 生成样本来改善其策略的简单算法，可以有效地提高机器翻译的质量和效率。

强化自训练（ReST）的语言建模

Reinforced Self-Training (ReST) for Language Modeling

Most domain adaptation methods for machine reading comprehension (MRC) use a
pre-trained question-answer (QA) construction model to generate pseudo QA pairs
for MRC transfer. Such a process will inevitably introduce mismatched pairs
(i.e., noisy correspondence) due to i) the unavailable QA pairs in target
documents, and ii) the domain shift during applying the QA construction model
to the target domain. Undoubtedly, the noisy correspondence will degenerate the
performance of MRC, which however is neglected by existing works. To solve such
an untouched problem, we propose to construct QA pairs by additionally using
the dialogue related to the documents, as well as a new domain adaptation
method for MRC. Specifically, we propose Robust Domain Adaptation for Machine
Reading Comprehension (RMRC) method which consists of an answer extractor (AE),
a question selector (QS), and an MRC model. Specifically, RMRC filters out the
irrelevant answers by estimating the correlation to the document via the AE,
and extracts the questions by fusing the candidate questions in multiple rounds
of dialogue chats via the QS. With the extracted QA pairs, MRC is fine-tuned
and provides the feedback to optimize the QS through a novel reinforced
self-training method. Thanks to the optimization of the QS, our method will
greatly alleviate the noisy correspondence problem caused by the domain shift.
To the best of our knowledge, this could be the first study to reveal the
influence of noisy correspondence in domain adaptation MRC models and show a
feasible way to achieve robustness to mismatched pairs. Extensive experiments
on three datasets demonstrate the effectiveness of our method.

本文提出了一种名为 RMRC 的方法，使用对话和领域自适应技术构建问题 - 答案对，通过答案提取器和问题选择器以及增强式自训练方法进行优化，从而解决机器阅读理解中因领域迁移引起的问题，包括噪声对应和效率降低，实验证明了该方法的有效性。