Model-free RL-based recommender systems have recently received increasing
research attention due to their capability to handle partial feedback and
long-term rewards. However, most existing research has ignored a critical
feature in recommender systems: one user's feedback on the same item at
different times is random. The stochastic rewards property essentially differs
from that in classic RL scenarios with deterministic rewards, which makes
RL-based recommender systems much more challenging. In this paper, we first
demonstrate in a simulator environment where using direct stochastic feedback
results in a significant drop in performance. Then to handle the stochastic
feedback more efficiently, we design two stochastic reward stabilization
frameworks that replace the direct stochastic feedback with that learned by a
supervised model. Both frameworks are model-agnostic, i.e., they can
effectively utilize various supervised models. We demonstrate the superiority
of the proposed frameworks over different RL-based recommendation baselines
with extensive experiments on a recommendation simulator as well as an
industrial-level recommender system.

基于无模型的强化学习推荐系统，通过引入两种随机奖励稳定化框架以替换直接的随机反馈，成功应对了用户在不同时间对同一项的随机反馈问题。

基于随机奖励稳定化的模型无关强化学习在推荐系统中的应用

Model-free Reinforcement Learning with Stochastic Reward Stabilization  for Recommender Systems

We study a multi-agent reinforcement learning (MARL) problem where the agents
interact over a given network. The goal of the agents is to cooperatively
maximize the average of their entropy-regularized long-term rewards. To
overcome the curse of dimensionality and to reduce communication, we propose a
Localized Policy Iteration (LPI) algorithm that provably learns a
near-globally-optimal policy using only local information. In particular, we
show that, despite restricting each agent's attention to only its $\kappa$-hop
neighborhood, the agents are able to learn a policy with an optimality gap that
decays polynomially in $\kappa$. In addition, we show the finite-sample
convergence of LPI to the global optimal policy, which explicitly captures the
trade-off between optimality and computational complexity in choosing $\kappa$.
Numerical simulations demonstrate the effectiveness of LPI.

本研究提出了一种名为局部策略迭代的算法，可以通过提高智能体之间的合作，最大化长期奖励的平均值，解决了多智能体强化学习问题中所面临的维度诅咒和通信限制的问题。