In this work, we study the issue of reward hacking on the response length, a
challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on
LLMs. A well-formatted, verbose but less helpful response from the LLMs can
often deceive LLMs or even human evaluators to achieve high scores. The same
issue also holds for some reward models in RL. To address the challenges in
both training and evaluation, we establish a more reliable evaluation protocol
for comparing different training configurations, which inspects the trade-off
between LLM evaluation score and response length obtained by varying training
hyperparameters. Based on this evaluation, we conduct large-scale studies,
where the results shed insights into the efficacy of hyperparameters and tricks
used in RL on mitigating length bias. We further propose to improve the reward
model by jointly training two linear heads on shared feature representations to
predict the rewards, one trained to correlate with length, and the other
trained to decorrelate with length and therefore focus more on the actual
content. We then discard the length head in RL to prevent reward hacking on
length. Experiments demonstrate that our approach almost eliminates the reward
correlation with length, and improves the obtained policy by a significant
margin.

通过建立评估协议和使用共享特征表示的两个线性头部，训练模型以预测奖励，一个与长度相关，另一个与长度无关，从而更关注实际内容，以减少奖励与长度的相关性并显著提高策略的性能。

ODIN: 异构奖励减轻 RLHF 中的黑客攻击

ODIN: Disentangled Reward Mitigates Hacking in RLHF

We perform an effective-theory analysis of forward-backward signal
propagation in wide and deep Transformers, i.e., residual neural networks with
multi-head self-attention blocks and multilayer perceptron blocks. This
analysis suggests particular width scalings of initialization and training
hyperparameters for these models. We then take up such suggestions, training
Vision and Language Transformers in practical setups.

本文针对宽且深的 Transformer 模型中的正反向信号传播进行了有效理论分析，提出了相应的模型初始化和训练超参数的宽度缩放建议，最终在实际场景中训练了视觉和语言的 Transformer 模型

初始状态下的 Transformer 有效理论

Effective Theory of Transformers at Initialization

There have been long-standing controversies and inconsistencies over the
experiment setup and criteria for identifying the "winning ticket" in
literature. To reconcile such, we revisit the definition of lottery ticket
hypothesis, with comprehensive and more rigorous conditions. Under our new
definition, we show concrete evidence to clarify whether the winning ticket
exists across the major DNN architectures and/or applications. Through
extensive experiments, we perform quantitative analysis on the correlations
between winning tickets and various experimental factors, and empirically study
the patterns of our observations. We find that the key training
hyperparameters, such as learning rate and training epochs, as well as the
architecture characteristics such as capacities and residual connections, are
all highly correlated with whether and when the winning tickets can be
identified. Based on our analysis, we summarize a guideline for parameter
settings in regards of specific architecture characteristics, which we hope to
catalyze the research progress on the topic of lottery ticket hypothesis. Our
codes are publicly available at:
this https URL

本文重新定义了 Lottery Ticket Hypothesis 的概念，并通过大量实验进一步证明了优化超参以及架构特性和中奖模型的相关性，提出了相应的参数设置指南，以促进 Lottery Ticket Hypothesis 领域的研究进展。