Training artificial agents to acquire desired skills through model-free reinforcement learning (RL) depends heavily on domain-specific knowledge, and the ability to reset the system to desirable configurations for better reward signals. The former hinders generalization to new domains; the latter precludes training in real-life conditions because physical resets are not scalable. Recently, intrinsic motivation was proposed as an alternative objective to alleviate the first issue, but there has been no reasonable remedy for the second. In this work, we present an efficient online algorithm for a type of intrinsic motivation, known as empowerment, and address both limitations. Our method is distinguished by its significantly lower sample and computation complexity, along with improved training stability compared to the relevant state of the art. We achieve this superior efficiency by transforming the challenging empowerment computation into a convex optimization problem through neural networks. In simulations, our method manages to train policies with neither domain-specific knowledge nor manual intervention. To address the issue of resetting in RL, we further show that our approach boosts learning when there's no early termination. Our proposed method opens doors for studying intrinsic motivation for policy training and scaling up model-free RL training in real-life conditions.

为了解决通过变分下界(VLB)计算确定性经验上均衡值(EEI)方法存在的繁琐、高成本等问题，本文采用了基于可训练的高斯信道来构建一种通用的无偏EM算法，提出了一种新型方法，能够在不考虑外在奖励的情况下，通过包含每个执行器和未来状态之间的量的交互信息来实现不同控制环境下的稳定性控制，而且该方法能够大大降低采样的复杂性，并展示了该方法的优点。

无监督稳定性高效赋能估计