This paper studies the problem of learning interactive recommender systems from logged feedbacks without any exploration in online environments. We address the problem by proposing a general offline reinforcement learning framework for recommendation, which enables maximizing cumulative user rewards without online exploration. Specifically, we first introduce a probabilistic generative model for interactive recommendation, and then propose an effective inference algorithm for discrete and stochastic policy learning based on logged feedbacks. In order to perform offline learning more effectively, we propose five approaches to minimize the distribution mismatch between the logging policy and recommendation policy: support constraints, supervised regularization, policy constraints, dual constraints and reward extrapolation. We conduct extensive experiments on two public real-world datasets, demonstrating that the proposed methods can achieve superior performance over existing supervised learning and reinforcement learning methods for recommendation.

该论文研究了在在线环境中无需探索的情况下，从已记录的反馈中学习互动推荐系统的问题，并提出了一种通用的离线强化学习框架用于推荐，可以通过最大化累积用户奖励来解决问题。为了更有效地进行离线学习，我们提出了五种方法来最小化记录策略和推荐策略之间的分布不匹配：支持约束、监督正则化、策略约束、双重约束和奖励外推。我们在两个公开的现实世界数据集上进行了广泛的实验，证明了所提出的方法在推荐方面相对于现有的监督学习和强化学习方法具有优越的性能。

一个通用的离线强化学习框架用于交互推荐