The human intrinsic desire to pursue knowledge, also known as curiosity, is
considered essential in the process of skill acquisition. With the aid of
artificial curiosity, we could equip current techniques for control, such as
Reinforcement Learning, with more natural exploration capabilities. A promising
approach in this respect has consisted of using Bayesian surprise on model
parameters, i.e. a metric for the difference between prior and posterior
beliefs, to favour exploration. In this contribution, we propose to apply
Bayesian surprise in a latent space representing the agent's current
understanding of the dynamics of the system, drastically reducing the
computational costs. We extensively evaluate our method by measuring the
agent's performance in terms of environment exploration, for continuous tasks,
and looking at the game scores achieved, for video games. Our model is
computationally cheap and compares positively with current state-of-the-art
methods on several problems. We also investigate the effects caused by
stochasticity in the environment, which is often a failure case for
curiosity-driven agents. In this regime, the results suggest that our approach
is resilient to stochastic transitions.

利用人工好奇心提高强化学习系统的探索能力，本文提出了一种利用贝叶斯惊奇度作为衡量模型参数先验和后验之间差异的方法，将其应用于代理模型的潜在空间中，大大降低计算成本，研究表明其对连续任务的环境探索和视频游戏分数的影响要好于当前最先进技术，同时具有对抗随机性环境的鲁棒性。

潜在贝叶斯惊喜驱动下的好奇心驱动探索

Curiosity-Driven Exploration via Latent Bayesian Surprise

Because reinforcement learning suffers from a lack of scalability, online
value (and Q-) function approximation has received increasing interest this
last decade. This contribution introduces a novel approximation scheme, namely
the Kalman Temporal Differences (KTD) framework, that exhibits the following
features: sample-efficiency, non-linear approximation, non-stationarity
handling and uncertainty management. A first KTD-based algorithm is provided
for deterministic Markov Decision Processes (MDP) which produces biased
estimates in the case of stochastic transitions. Than the eXtended KTD
framework (XKTD), solving stochastic MDP, is described. Convergence is analyzed
for special cases for both deterministic and stochastic transitions. Related
algorithms are experimented on classical benchmarks. They compare favorably to
the state of the art while exhibiting the announced features.

介绍了一个新的近似框架，即卡尔曼时间差异（KTD）框架，用于解决强化学习中估值函数的扩展问题，并提供了解决确定性和随机性马尔可夫决策过程的 KTD 和 XKTD 算法，证明了其收敛性和比现有算法更好的性能。