In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page can only return feedback as moving forward in the process or dropping from it until a termination state. We refer to this type of problems as sequential decision making in linear--flow. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. At recommendation time, we use Thompson sampling (TS) to sample the transition probabilities and allocate the best series of actions with analytical solution through exact dynamic programming. The way that we formulate the problem allows us to leverage TS's efficiency in balancing exploration and exploitation and Bandit's convenience in modeling actions' incompatibility. In the simulation study, we observe the proposed MDP with Bandits algorithm outperforms Q-learning with $\epsilon$-greedy and decreasing $\epsilon$, independent Bandits, and interaction Bandits. We also find the proposed algorithm's performance is the most robust to changes in the across-page interdependence strength.

本文提出了一种基于MDP与Bandits的Thompson采样算法及其在推荐场景中的应用，旨在解决线性流程中的循环决策问题，该算法在模拟实验中表现出色，特别是在跨页相互依赖变化大的情况下，表现最为鲁棒。

基于赌博机的马尔可夫决策过程在线性流中的序列决策建模