Safe reinforcement learning (RL) with hard constraint guarantees is a
promising optimal control direction for multi-energy management systems. It
only requires the environment-specific constraint functions itself a prior and
not a complete model (i.e. plant, disturbance and noise models, and prediction
models for states not included in the plant model - e.g. demand, weather, and
price forecasts). The project-specific upfront and ongoing engineering efforts
are therefore still reduced, better representations of the underlying system
dynamics can still be learned and modeling bias is kept to a minimum (no
model-based objective function). However, even the constraint functions alone
are not always trivial to accurately provide in advance (e.g. an energy balance
constraint requires the detailed determination of all energy inputs and
outputs), leading to potentially unsafe behavior. In this paper, we present two
novel advancements: (I) combining the Optlayer and SafeFallback method, named
OptLayerPolicy, to increase the initial utility while keeping a high sample
efficiency. (II) introducing self-improving hard constraints, to increase the
accuracy of the constraint functions as more data becomes available so that
better policies can be learned. Both advancements keep the constraint
formulation decoupled from the RL formulation, so that new (presumably better)
RL algorithms can act as drop-in replacements. We have shown that, in a
simulated multi-energy system case study, the initial utility is increased to
92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after
training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4%
(OptLayer) - all relative to a vanilla RL benchmark. While introducing
surrogate functions into the optimization problem requires special attention,
we do conclude that the newly presented GreyOptLayerPolicy method is the most
advantageous.

本文介绍了两项新的安全强化学习方法，OptLayerPolicy 和 self-improving hard constraints，将约束函数与 RL 形式解耦，以提高初始效用和准确性，提供了在模拟的多能源系统案例研究中实现 92.4%（OptLayerPolicy）的初始效用和 104.9%（GreyOptLayerPolicy）的策略的结果。