We describe and study a model for an Automated Online Recommendation System
(AORS) in which a user's preferences can be time-dependent and can also depend
on the history of past recommendations and play-outs. The three key features of
the model that makes it more realistic compared to existing models for
recommendation systems are (1) user preference is inherently latent, (2)
current recommendations can affect future preferences, and (3) it allows for
the development of learning algorithms with provable performance guarantees.
The problem is cast as an average-cost restless multi-armed bandit for a given
user, with an independent partially observable Markov decision process (POMDP)
for each item of content. We analyze the POMDP for a single arm, describe its
structural properties, and characterize its optimal policy. We then develop a
Thompson sampling-based online reinforcement learning algorithm to learn the
parameters of the model and optimize utility from the binary responses of the
users to continuous recommendations. We then analyze the performance of the
learning algorithm and characterize the regret. Illustrative numerical results
and directions for extension to the restless hidden Markov multi-armed bandit
problem are also presented.

本文提出了一种自动在线推荐系统的模型，其中用户的喜好是时变的并且可以依赖于过去的推荐历史和玩出历史，通过使用基于 Thompson 采样的在线强化学习算法，该模型可以学习优化推荐效果，并具有可证明的性能保证。