Large language models (LLMs) solve problems more accurately and interpretably
when instructed to work out the answer step by step using a
``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a
specific task by supervised fine-tuning, i.e., by using gradient ascent on some
tunable parameters to maximize the average log-likelihood of correct answers
from a labeled training set. Naively combining CoT with supervised tuning
requires supervision not just of the correct answers, but also of detailed
rationales that lead to those answers; these rationales are expensive to
produce by hand. Instead, we propose a fine-tuning strategy that tries to
maximize the \emph{marginal} log-likelihood of generating a correct answer
using CoT prompting, approximately averaging over all possible rationales. The
core challenge is sampling from the posterior over rationales conditioned on
the correct answer; we address it using a simple Markov-chain Monte Carlo
(MCMC) expectation-maximization (EM) algorithm inspired by the self-taught
reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent
contrastive divergence. This algorithm also admits a novel control-variate
technique that drives the variance of our gradient estimates to zero as the
model improves. Applying our technique to GSM8K and the tasks in BIG-Bench
Hard, we find that this MCMC-EM fine-tuning technique typically improves the
model's accuracy on held-out examples more than STaR or prompt-tuning with or
without CoT.

大型语言模型通过使用 ``思维链 '' 提示以逐步解决问题的方式更准确地解释，一种监督微调的方法是通过使用可调参数的梯度上升来最大化标记训练集中正确答案的平均对数似然。然而，我们提出了一种微调策略，尝试通过使用思维链提示最大化生成正确答案的`` 边际 '' 对数似然，大致平均所有可能的解释。我们使用受自学习推理器、备忘录式唤醒 - 休眠、马尔可夫性分数爬升和持续对比散度启发的简单马尔可夫链蒙特卡罗 - 期望最大化 (EM) 算法来解决条件于正确答案的解释后验分布的采样问题，并采用一种新颖的控制变量技术，随着模型的改进，将逐渐降低梯度估计的方差。将我们的技术应用于 GSM8K 和 BIG-Bench Hard 中的任务，我们发现这种 MCMC-EM 微调技术通常比 STaR 或带有或不带有思维链提示的微调方法在留存样例上提高模型准确性。