The aim of this paper is to demonstrate the efficacy of using Contrastive Random Walk as a curiosity method to achieve faster convergence to the optimal policy.Contrastive Random Walk defines the transition matrix of a random walk with the help of neural networks. It learns a meaningful state representation with a closed loop. The loss of Contrastive Random Walk serves as an intrinsic reward and is added to the environment reward. Our method works well in non-tabular sparse reward scenarios, in the sense that our method receives the highest reward within the same iterations compared to other methods. Meanwhile, Contrastive Random Walk is more robust. The performance doesn't change much with different random initialization of environments. We also find that adaptive restart and appropriate temperature are crucial to the performance of Contrastive Random Walk.

本文旨在展示使用对比随机游走作为好奇心方法以实现更快的收敛至最优策略的有效性。其中，对比随机游走通过神经网络定义随机游走的转移矩阵，学习有意义的状态表示，然后将其作为内在奖励添加到环境奖励中。作者在非表格化稀疏奖励场景中证明了对比随机游走的鲁棒性，并统计表明该方法可以在同样迭代次数下获得最高奖励。作者同时也发现，适应性重启和适当的温度对对比随机游走的性能至关重要。

用对比随机游走发现内在奖励