We study the budget allocation problem in online marketing campaigns that
utilize previously collected offline data. We first discuss the long-term
effect of optimizing marketing budget allocation decisions in the offline
setting. To overcome the challenge, we propose a novel game-theoretic offline
value-based reinforcement learning method using mixed policies. The proposed
method reduces the need to store infinitely many policies in previous methods
to only constantly many policies, which achieves nearly optimal policy
efficiency, making it practical and favorable for industrial usage. We further
show that this method is guaranteed to converge to the optimal policy, which
cannot be achieved by previous value-based reinforcement learning methods for
marketing budget allocation. Our experiments on a large-scale marketing
campaign with tens-of-millions users and more than one billion budget verify
the theoretical results and show that the proposed method outperforms various
baseline methods. The proposed method has been successfully deployed to serve
all the traffic of this marketing campaign.

提出一种基于值函数的强化学习方法来解决在线营销活动中利用离线数据进行预算分配的问题，该方法通过使用混合策略减少存储策略的数量，并实现了接近最优策略的效率，经过大规模的营销活动实验证明该方法优于其他基准方法。

离线约束深度强化学习中的营销预算分配

Marketing Budget Allocation with Offline Constrained Deep Reinforcement  Learning

Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal
control. Recently, the finite-time analysis of Greedy-GQ has been developed
under linear function approximation and Markovian sampling, and the algorithm
is shown to achieve an $\epsilon$-stationary point with a sample complexity in
the order of $\mathcal{O}(\epsilon^{-3})$. Such a high sample complexity is due
to the large variance induced by the Markovian samples. In this paper, we
propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy
optimal control. In particular, the algorithm applies the SVRG-based variance
reduction scheme to reduce the stochastic variance of the two time-scale
updates. We study the finite-time convergence of VR-Greedy-GQ under linear
function approximation and Markovian sampling and show that the algorithm
achieves a much smaller bias and variance error than the original Greedy-GQ. In
particular, we prove that VR-Greedy-GQ achieves an improved sample complexity
that is in the order of $\mathcal{O}(\epsilon^{-2})$. We further compare the
performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to
corroborate our theoretical findings.

本文介绍了基于价值的增强学习中的一种算法 ——Greedy-GQ 以及其演化版的 VR-Greedy-GQ，通过降低方差，提高了算法的收敛速度，显著减小了误差，同时证明了算法的收敛性和较小的采样复杂度，最后还得出了实验结果。