Recent advances in reinforcement learning have inspired increasing interest
in learning user modeling adaptively through dynamic interactions, e.g., in
reinforcement learning based recommender systems. Reward function is crucial
for most of reinforcement learning applications as it can provide the guideline
about the optimization. However, current reinforcement-learning-based methods
rely on manually-defined reward functions, which cannot adapt to dynamic and
noisy environments. Besides, they generally use task-specific reward functions
that sacrifice generalization ability. We propose a generative inverse
reinforcement learning for user behavioral preference modelling, to address the
above issues. Instead of using predefined reward functions, our model can
automatically learn the rewards from user's actions based on discriminative
actor-critic network and Wasserstein GAN. Our model provides a general way of
characterizing and explaining underlying behavioral tendencies, and our
experiments show our method outperforms state-of-the-art methods in a variety
of scenarios, namely traffic signal control, online recommender systems, and
scanpath prediction.

提出了一种基于生成式逆强化学习的用户行为偏好建模方法，该方法可以自动学习用户的行为奖励函数，并通过辨别式演员 - 评论家网络和 Wasserstein 生成对抗网络进行建模和解释，实验证明该方法在交通信号控制、在线推荐系统和注视路径预测等场景下优于现有的方法。