In recent years, there are great interests as well as challenges in applying reinforcement learning (RL) to recommendation systems (RS). In this paper, we summarize three key practical challenges of large-scale RL-based recommender systems: massive state and action spaces, high-variance environment, and the unspecific reward setting in recommendation. All these problems remain largely unexplored in the existing literature and make the application of RL challenging. We develop a model-based reinforcement learning framework with a disentangled universal value function, called GoalRec. Combining the ideas of world model (model-based), value function estimation (model-free), and goal-based RL, a novel model-based value function formalization is proposed. It can generalize to various goals that the recommender may have, and disentangle the stochastic environmental dynamics and high-variance reward signals accordingly. As a part of the value function, free from the sparse and high-variance reward signals, a high-capacity reward-irrelevant world model is trained to simulate complex environmental dynamics under a certain goal. Based on the predicted environmental dynamics, the disentangled universal value function is related to the user's future trajectory instead of a monolithic state and a scalar reward. We demonstrate the superiority of GoalRec over previous approaches in terms of the above three practical challenges in a series of simulations and a real application.

本文介绍基于强化学习在推荐系统中应用的挑战及其解决方法GoalRec，提出一种新型分离式通用价值函数，可以泛化到各种不同的目标，并根据高方差的环境动态和奖励信号进行分离。在一系列模拟和实际应用中，GoalRec显示出优越的实用性，解决了大规模RL-based推荐系统的重要挑战。 

基于解耦通用值函数的强化学习在物品推荐中的应用