Autoregressive large language models (LLMs) compress knowledge from their
training data through next-token conditional distributions. This limits
tractable querying of this knowledge to start-to-end autoregressive sampling.
However, many tasks of interest -- including sequence continuation, infilling,
and other forms of constrained generation -- involve sampling from intractable
posterior distributions. We address this limitation by using amortized Bayesian
inference to sample from these intractable posteriors. Such amortization is
algorithmically achieved by fine-tuning LLMs via diversity-seeking
reinforcement learning algorithms: generative flow networks (GFlowNets). We
empirically demonstrate that this distribution-matching paradigm of LLM
fine-tuning can serve as an effective alternative to maximum-likelihood
training and reward-maximizing policy optimization. As an important
application, we interpret chain-of-thought reasoning as a latent variable
modeling problem and demonstrate that our approach enables data-efficient
adaptation of LLMs to tasks that require multi-step rationalization and tool
use.

通过使用归约化贝叶斯推理方法从难以通过条件概率分布采样的后验分布中提取样本，我们展示了这种分布匹配模型在 LLM 微调中作为最大似然训练和奖励最大化策略优化的有效替代方法，进而实现了对多步骤推理和工具使用任务的数据高效适应。