Combining Reinforcement Learning (RL) with a prior controller can yield the
best out of two worlds: RL can solve complex nonlinear problems, while the
control prior ensures safer exploration and speeds up training. Prior work
largely blends both components with a fixed weight, neglecting that the RL
agent's performance varies with the training progress and across regions in the
state space. Therefore, we advocate for an adaptive strategy that dynamically
adjusts the weighting based on the RL agent's current capabilities. We propose
a new adaptive hybrid RL algorithm, Contextualized Hybrid Ensemble Q-learning
(CHEQ). CHEQ combines three key ingredients: (i) a time-invariant formulation
of the adaptive hybrid RL problem treating the adaptive weight as a context
variable, (ii) a weight adaption mechanism based on the parametric uncertainty
of a critic ensemble, and (iii) ensemble-based acceleration for data-efficient
RL. Evaluating CHEQ on a car racing task reveals substantially stronger data
efficiency, exploration safety, and transferability to unknown scenarios than
state-of-the-art adaptive hybrid RL methods.

结合强化学习和先验控制器可以获得两个世界中的最佳结果：强化学习可以解决复杂的非线性问题，而控制器可以确保更安全的探索和加快训练。本文提出了一种新的自适应混合强化学习算法，通过动态调整加权来适应强化学习代理当前的能力，从而提高数据效率、探索安全性和对未知场景的可迁移性。

上下文化混合集成 Q 学习：以控制先验快速学习

Contextualized Hybrid Ensemble Q-learning: Learning Fast with Control  Priors

Recently, there has been increasing interest in efficient pretraining
paradigms for training Transformer-based models. Several recent approaches use
smaller models to initialize larger models in order to save computation (e.g.,
stacking and fusion). In this work, we study the fundamental question of how to
select the best growing strategy from a given pool of growing strategies. Prior
works have extensively focused on loss- and/or function-preserving behavior at
initialization or simply performance at the end of training. Instead, we
identify that behavior at initialization can be misleading as a predictor of
final performance and present an alternative perspective based on early
training dynamics, which we call "landscape-aware growing (LAG)". We perform
extensive analysis of correlation of the final performance with performance in
the initial steps of training and find early and more accurate predictions of
the optimal growing strategy (i.e., with only a small "lag" after
initialization). This perspective also motivates an adaptive strategy for
gradual stacking.

Efficient pretraining paradigms and growing strategies for Transformer-based models are studied, focusing on early training dynamics and an adaptive strategy for gradual stacking.

景观感知增长：一点点 Lag 的力量

Landscape-Aware Growing: The Power of a Little LAG

Animals are equipped with a rich innate repertoire of sensory, behavioral and
motor skills, which allows them to interact with the world immediately after
birth. At the same time, many behaviors are highly adaptive and can be tailored
to specific environments by means of learning. In this work, we use
mathematical analysis and the framework of meta-learning (or 'learning to
learn') to answer when it is beneficial to learn such an adaptive strategy and
when to hard-code a heuristic behavior. We find that the interplay of
ecological uncertainty, task complexity and the agents' lifetime has crucial
effects on the meta-learned amortized Bayesian inference performed by an agent.
There exist two regimes: One in which meta-learning yields a learning algorithm
that implements task-dependent information-integration and a second regime in
which meta-learning imprints a heuristic or 'hard-coded' behavior. Further
analysis reveals that non-adaptive behaviors are not only optimal for aspects
of the environment that are stable across individuals, but also in situations
where an adaptation to the environment would in fact be highly beneficial, but
could not be done quickly enough to be exploited within the remaining lifetime.
Hard-coded behaviors should hence not only be those that always work, but also
those that are too complex to be learned within a reasonable time frame.

本文利用数学分析和元学习（或 ' 学习学习 '）框架回答了何时学习这种自适应策略以及何时将启发式行为硬编码的问题。我们发现，生态不确定性，任务复杂性和代理的寿命相互作用对代理执行的元 - 学习度量贝叶斯推断具有关键影响。

学习不学习：人工智能中的天性与后天

Learning Not to Learn: Nature versus Nurture in Silico

Suppose we can sequentially acquire arbitrary linear measurements of an
n-dimensional vector x resulting in the linear model y = Ax + z, where z
represents measurement noise. If the signal is known to be sparse, one would
expect the following folk theorem to be true: choosing an adaptive strategy
which cleverly selects the next row of A based on what has been previously
observed should do far better than a nonadaptive strategy which sets the rows
of A ahead of time, thus not trying to learn anything about the signal in
between observations. This paper shows that the folk theorem is false. We prove
that the advantages offered by clever adaptive strategies and sophisticated
estimation procedures---no matter how intractable---over classical compressed
acquisition/recovery schemes are, in general, minimal.

研究了基于线性测量和自适应策略下的信号恢复问题，证明了即使采用自适应和复杂的估计算法，也无法显著提高恢复速度。