The divergence of the Q-value estimation has been a prominent issue in
offline RL, where the agent has no access to real dynamics. Traditional beliefs
attribute this instability to querying out-of-distribution actions when
bootstrapping value targets. Though this issue can be alleviated with policy
constraints or conservative Q estimation, a theoretical understanding of the
underlying mechanism causing the divergence has been absent. In this work, we
aim to thoroughly comprehend this mechanism and attain an improved solution. We
first identify a fundamental pattern, self-excitation, as the primary cause of
Q-value estimation divergence in offline RL. Then, we propose a novel
Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel
(NTK) to measure the evolving property of Q-network at training, which provides
an intriguing explanation of the emergence of divergence. For the first time,
our theory can reliably decide whether the training will diverge at an early
stage, and even predict the order of the growth for the estimated Q-value, the
model's norm, and the crashing step when an SGD optimizer is used. The
experiments demonstrate perfect alignment with this theoretic analysis.
Building on our insights, we propose to resolve divergence from a novel
perspective, namely improving the model's architecture for better extrapolating
behavior. Through extensive empirical studies, we identify LayerNorm as a good
solution to effectively avoid divergence without introducing detrimental bias,
leading to superior performance. Experimental results prove that it can still
work in some most challenging settings, i.e. using only 1 transitions of the
dataset, where all previous methods fail. Moreover, it can be easily plugged
into modern offline RL methods and achieve SOTA results on many challenging
tasks. We also give unique insights into its effectiveness.

在离线增强学习中，离线 Q 值估计的发散问题一直是一个突出的问题。本研究通过对机制的全面理解和对模型架构的改进，提出了解决发散问题的新途径，其中包括基于离线 RL 的自激励模式和通过 LayerNorm 架构提升性能。

离线强化学习中 Q 值离散度的理解、预测和改善

Understanding, Predicting and Better Resolving Q-Value Divergence in  Offline-RL

Hawkes Processes are a type of point process for modeling self-excitation,
i.e., when the occurrence of an event makes future events more likely to occur.
The corresponding self-triggering function of this type of process may be
inferred through an Unconstrained Optimization-based method for maximization of
its corresponding Loglikelihood function. Unfortunately, the non-convexity of
this procedure, along with the ill-conditioning of the initialization of the
self- triggering function parameters, may lead to a consequent instability of
this method. Here, we introduce Renormalization Factors, over four types of
parametric kernels, as a solution to this instability. These factors are
derived for each of the self-triggering function parameters, and also for more
than one parameter considered jointly. Experimental results show that the
Maximum Likelihood Estimation method shows improved performance with
Renormalization Factors over sets of sequences of several different lengths.

通过引入重整化因子，我们提供了一种解决非协调、不稳定的优化方法来最大化 Hawkes 过程相关对数似然函数的方法，并提高了一系列不同长度序列的最大似然估计方法性能。