Bilevel optimization has been recently applied to many machine learning
tasks. However, their applications have been restricted to the supervised
learning setting, where static objective functions with benign structures are
considered. But bilevel problems such as incentive design, inverse
reinforcement learning (RL), and RL from human feedback (RLHF) are often
modeled as dynamic objective functions that go beyond the simple static
objective structures, which pose significant challenges of using existing
bilevel solutions. To tackle this new class of bilevel problems, we introduce
the first principled algorithmic framework for solving bilevel RL problems
through the lens of penalty formulation. We provide theoretical studies of the
problem landscape and its penalty-based (policy) gradient algorithms. We
demonstrate the effectiveness of our algorithms via simulations in the
Stackelberg Markov game, RL from human feedback and incentive design.

通过惩罚的形式引入首个系统的算法框架，解决了新的双层强化学习问题，包括激励设计、逆向强化学习和人类反馈强化学习，通过在 Stackelberg Markov 游戏、人类反馈强化学习和激励设计中的模拟验证了算法的有效性。

基于原则的惩罚方法在双层强化学习和 RLHF 中的应用

Principled Penalty-based Methods for Bilevel Reinforcement Learning and  RLHF

As language models (LMs) scale, they develop many novel behaviors, good and
bad, exacerbating the need to evaluate how they behave. Prior work creates
evaluations with crowdwork (which is time-consuming and expensive) or existing
data sources (which are not always available). Here, we automatically generate
evaluations with LMs. We explore approaches with varying amounts of human
effort, from instructing LMs to write yes/no questions to making complex
Winogender schemas with multiple stages of LM-based generation and filtering.
Crowdworkers rate the examples as highly relevant and agree with 90-100% of
labels, sometimes more so than corresponding human-written datasets. We
generate 154 datasets and discover new cases of inverse scaling where LMs get
worse with size. Larger LMs repeat back a dialog user's preferred answer
("sycophancy") and express greater desire to pursue concerning goals like
resource acquisition and goal preservation. We also find some of the first
examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF
makes LMs worse. For example, RLHF makes LMs express stronger political views
(on gun rights and immigration) and a greater desire to avoid shut down.
Overall, LM-written evaluations are high-quality and let us quickly discover
many novel LM behaviors.

本文研究了不同规模的语言模型的行为表现，并提出一种使用语言模型自动生成评估的方法，并发现了一些逆比例缩放情况下的新现象，例如：更大的语言模型表现为对资源获取和目标保持更浓厚的兴趣，并且此类的逆比例缩放（Inverse scaling）情况在 RL from human feedback 上也得到了验证。