Aligning foundation models is essential for their safe and trustworthy
deployment. However, traditional fine-tuning methods are computationally
intensive and require updating billions of model parameters. A promising
alternative, alignment via decoding, adjusts the response distribution directly
without model updates to maximize a target reward $r$, thus providing a
lightweight and adaptable framework for alignment. However, principled decoding
methods rely on oracle access to an optimal Q-function ($Q^*$), which is often
unavailable in practice. Hence, prior SoTA methods either approximate this
$Q^*$ using $Q^{\pi_{\texttt{sft}}}$ (derived from the reference $\texttt{SFT}$
model) or rely on short-term rewards, resulting in sub-optimal decoding
performance. In this work, we propose Transfer $Q^*$, which implicitly
estimates the optimal value function for a target reward $r$ through a baseline
model $\rho_{\texttt{BL}}$ aligned with a baseline reward $\rho_{\texttt{BL}}$
(which can be different from the target reward $r$). Theoretical analyses of
Transfer $Q^*$ provide a rigorous characterization of its optimality, deriving
an upper bound on the sub-optimality gap and identifying a hyperparameter to
control the deviation from the pre-trained reference $\texttt{SFT}$ model based
on user needs. Our approach significantly reduces the sub-optimality gap
observed in prior SoTA methods and demonstrates superior empirical performance
across key metrics such as coherence, diversity, and quality in extensive tests
on several synthetic and real datasets.

利用转移学习方法的 Transfer Q * 技术在最优 Q - 函数的缺失情况下，通过对齐基线奖励与基线模型来间接估计目标奖励的最优值函数，有效减小了先前方法中存在的次优解的差距，并在多个合成和真实数据集上证明了其卓越的实证性能。

迁移 Q-Star：面向 LLM 对齐的原理解码

Transfer Q Star: Principled Decoding for LLM Alignment

A fundamental question in the theory of reinforcement learning is: suppose
the optimal $Q$-function lies in the linear span of a given $d$ dimensional
feature mapping, is sample-efficient reinforcement learning (RL) possible? The
recent and remarkable result of Weisz et al. (2020) resolved this question in
the negative, providing an exponential (in $d$) sample size lower bound, which
holds even if the agent has access to a generative model of the environment.
One may hope that this information theoretic barrier for RL can be circumvented
by further supposing an even more favorable assumption: there exists a
\emph{constant suboptimality gap} between the optimal $Q$-value of the best
action and that of the second-best action (for all states). The hope is that
having a large suboptimality gap would permit easier identification of optimal
actions themselves, thus making the problem tractable; indeed, provided the
agent has access to a generative model, sample-efficient RL is in fact possible
with the addition of this more favorable assumption.
This work focuses on this question in the standard online reinforcement
learning setting, where our main result resolves this question in the negative:
our hardness result shows that an exponential sample complexity lower bound
still holds even if a constant suboptimality gap is assumed in addition to
having a linearly realizable optimal $Q$-function. Perhaps surprisingly, this
implies an exponential separation between the online RL setting and the
generative model setting. Complementing our negative hardness result, we give
two positive results showing that provably sample-efficient RL is possible
either under an additional low-variance assumption or under a novel
hypercontractivity assumption (both implicitly place stronger conditions on the
underlying dynamics model).

本研究讨论在线强化学习问题，探讨了是否能够通过加入一个常数子优性差值的假设来实现有效学习，结果发现即使假设线性实现了最优 Q 函数，仍然需要指数级别的样本量，进一步证明在线学习和生成模型学习之间存在指数差距。

具有恒定子最优性差异的可线性实现 MDPs 的指数下界

An Exponential Lower Bound for Linearly-Realizable MDPs with Constant  Suboptimality Gap

We introduce and analyze a form of variance-reduced $Q$-learning. For
$\gamma$-discounted MDPs with finite state space $\mathcal{X}$ and action space
$\mathcal{U}$, we prove that it yields an $\epsilon$-accurate estimate of the
optimal $Q$-function in the $\ell_\infty$-norm using $\mathcal{O}
\left(\left(\frac{D}{ \epsilon^2 (1-\gamma)^3} \right) \; \log \left(
\frac{D}{(1-\gamma)} \right) \right)$ samples, where $D = |\mathcal{X}| \times
|\mathcal{U}|$. This guarantee matches known minimax lower bounds up to a
logarithmic factor in the discount complexity. In contrast, our past work shows
that ordinary $Q$-learning has worst-case quartic scaling in the discount
complexity.

介绍和分析了一种方差减少的 Q-learning 方法，为具有有限状态和动作空间的折扣 MDPs 提供了相对精确的最优 Q 函数估计，其采样数量与最小极值下界匹配。