Sequential incentive marketing is an important approach for online businesses
to acquire customers, increase loyalty and boost sales. How to effectively
allocate the incentives so as to maximize the return (e.g., business
objectives) under the budget constraint, however, is less studied in the
literature. This problem is technically challenging due to the facts that 1)
the allocation strategy has to be learned using historically logged data, which
is counterfactual in nature, and 2) both the optimality and feasibility (i.e.,
that cost cannot exceed budget) needs to be assessed before being deployed to
online systems. In this paper, we formulate the problem as a constrained Markov
decision process (CMDP). To solve the CMDP problem with logged counterfactual
data, we propose an efficient learning algorithm which combines bisection
search and model-based planning. First, the CMDP is converted into its dual
using Lagrangian relaxation, which is proved to be monotonic with respect to
the dual variable. Furthermore, we show that the dual problem can be solved by
policy learning, with the optimal dual variable being found efficiently via
bisection search (i.e., by taking advantage of the monotonicity). Lastly, we
show that model-based planing can be used to effectively accelerate the joint
optimization process without retraining the policy for every dual variable.
Empirical results on synthetic and real marketing datasets confirm the
effectiveness of our methods.

本文提出采用 CMDP 框架和模型规划相结合的学习算法，解决了在线商业活动中如何高效地分配奖励从以往的历史订单数据中学习策略的问题。实验结果表明了本方法的有效性。

基于模型的约束 MDP 在序列激励营销中的预算分配

Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing

E-commerce platforms usually display a mixed list of ads and organic items in
feed. One key problem is to allocate the limited slots in the feed to maximize
the overall revenue as well as improve user experience, which requires a good
model for user preference. Instead of modeling the influence of individual
items on user behaviors, the arrangement signal models the influence of the
arrangement of items and may lead to a better allocation strategy. However,
most of previous strategies fail to model such a signal and therefore result in
suboptimal performance. In addition, the percentage of ads exposed (PAE) is an
important indicator in ads allocation. Excessive PAE hurts user experience
while too low PAE reduces platform revenue. Therefore, how to constrain the PAE
within a certain range while keeping personalized recommendation under the PAE
constraint is a challenge. In this paper, we propose Cross Deep Q Network
(Cross DQN) to extract the crucial arrangement signal by crossing the
embeddings of different items and modeling the crossed sequence by
multi-channel attention. Besides, we propose an auxiliary loss for batch-level
constraint on PAE to tackle the above-mentioned challenge. Our model results in
higher revenue and better user experience than state-of-the-art baselines in
offline experiments. Moreover, our model demonstrates a significant improvement
in the online A/B test and has been fully deployed on Meituan feed to serve
more than 300 millions of customers.

本文提出了 Cross Deep Q Network（Cross DQN）的模型，通过交叉不同物品的嵌入来提取重要的排列信号并通过多通道注意力建模。此外，我们提出了一种辅助损失来处理广告暴露率的批级约束，以在保持个性化推荐的同时将广告暴露率限制在一定范围内，该模型经离线和在线实验证明在平台上获得了更高的收入和更好的用户体验。