Contextual bandit learning is increasingly favored in modern large-scale
recommendation systems. To better utlize the contextual information and
available user or item features, the integration of neural networks have been
introduced to enhance contextual bandit learning and has triggered significant
interest from both academia and industry. However, a major challenge arises
when implementing a disjoint neural contextual bandit solution in large-scale
recommendation systems, where each item or user may correspond to a separate
bandit arm. The huge number of items to recommend poses a significant hurdle
for real world production deployment. This paper focuses on a joint neural
contextual bandit solution which serves all recommending items in one single
model. The output consists of a predicted reward $\mu$, an uncertainty $\sigma$
and a hyper-parameter $\alpha$ which balances exploitation and exploration,
e.g., $\mu + \alpha \sigma$.
The tuning of the parameter $\alpha$ is typically heuristic and complex in
practice due to its stochastic nature. To address this challenge, we provide
both theoretical analysis and experimental findings regarding the uncertainty
$\sigma$ of the joint neural contextual bandit model. Our analysis reveals that
$\alpha$ demonstrates an approximate square root relationship with the size of
the last hidden layer $F$ and inverse square root relationship with the amount
of training data $N$, i.e., $\sigma \propto \sqrt{\frac{F}{N}}$. The
experiments, conducted with real industrial data, align with the theoretical
analysis, help understanding model behaviors and assist the hyper-parameter
tuning during both offline training and online deployment.

通过引入神经网络增强情境强化学习，本文提出了一种适用于大规模推荐系统的联合神经情境强化学习解决方案，该方案将所有推荐物品集成到一个单一模型中，并通过理论分析和实验结果揭示了超参数调整过程中的不确定性，为离线训练和在线部署提供了帮助。