Large language models (LLMs) solve problems more accurately and interpretably
when instructed to work out the answer step by step using a
``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a
specific task by supervised fine-tuning, i.e., by using gradient ascent on some
tunable parameters to maximize the average log-likelihood of correct answers
from a labeled training set. Naively combining CoT with supervised tuning
requires supervision not just of the correct answers, but also of detailed
rationales that lead to those answers; these rationales are expensive to
produce by hand. Instead, we propose a fine-tuning strategy that tries to
maximize the \emph{marginal} log-likelihood of generating a correct answer
using CoT prompting, approximately averaging over all possible rationales. The
core challenge is sampling from the posterior over rationales conditioned on
the correct answer; we address it using a simple Markov-chain Monte Carlo
(MCMC) expectation-maximization (EM) algorithm inspired by the self-taught
reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent
contrastive divergence. This algorithm also admits a novel control-variate
technique that drives the variance of our gradient estimates to zero as the
model improves. Applying our technique to GSM8K and the tasks in BIG-Bench
Hard, we find that this MCMC-EM fine-tuning technique typically improves the
model's accuracy on held-out examples more than STaR or prompt-tuning with or
without CoT.

大型语言模型通过使用 ``思维链 '' 提示以逐步解决问题的方式更准确地解释，一种监督微调的方法是通过使用可调参数的梯度上升来最大化标记训练集中正确答案的平均对数似然。然而，我们提出了一种微调策略，尝试通过使用思维链提示最大化生成正确答案的`` 边际 '' 对数似然，大致平均所有可能的解释。我们使用受自学习推理器、备忘录式唤醒 - 休眠、马尔可夫性分数爬升和持续对比散度启发的简单马尔可夫链蒙特卡罗 - 期望最大化 (EM) 算法来解决条件于正确答案的解释后验分布的采样问题，并采用一种新颖的控制变量技术，随着模型的改进，将逐渐降低梯度估计的方差。将我们的技术应用于 GSM8K 和 BIG-Bench Hard 中的任务，我们发现这种 MCMC-EM 微调技术通常比 STaR 或带有或不带有思维链提示的微调方法在留存样例上提高模型准确性。

通过潜变量推断训练思维链

Training Chain-of-Thought via Latent-Variable Inference

Though with progress, model learning and performing posterior inference still
remains a common challenge for using deep generative models, especially for
handling discrete hidden variables. This paper is mainly concerned with
algorithms for learning Helmholz machines, which is characterized by pairing
the generative model with an auxiliary inference model. A common drawback of
previous learning algorithms is that they indirectly optimize some bounds of
the targeted marginal log-likelihood. In contrast, we successfully develop a
new class of algorithms, based on stochastic approximation (SA) theory of the
Robbins-Monro type, to directly optimize the marginal log-likelihood and
simultaneously minimize the inclusive KL-divergence. The resulting learning
algorithm is thus called joint SA (JSA). Moreover, we construct an effective
MCMC operator for JSA. Our results on the MNIST datasets demonstrate that the
JSA's performance is consistently superior to that of competing algorithms like
RWS, for learning a range of difficult models.

本文描述了一种新型基于随机逼近理论 (Robbins-Monro type) 的算法，直接优化边缘对数似然并同时最小化 KL 散度，以便更好地学习和应用针对离散隐藏变量的深度生成模型，称为联合随机逼近算法 (JSA)，并构建了一个有效的 MCMC 操作符用于优化 JSA 的表现。同时，在 MNIST 数据集上的实验表明，JSA 相对于类似 RWS 等算法，可以显著提高学习不同复杂度模型的性能。