Making online decisions can be challenging when features are sparse and
orthogonal to historical ones, especially when the optimal policy is learned
through collaborative filtering. We formulate the problem as a matrix
completion bandit (MCB), where the expected reward under each arm is
characterized by an unknown low-rank matrix. The $\epsilon$-greedy bandit and
the online gradient descent algorithm are explored. Policy learning and regret
performance are studied under a specific schedule for exploration probabilities
and step sizes. A faster decaying exploration probability yields smaller regret
but learns the optimal policy less accurately. We investigate an online
debiasing method based on inverse propensity weighting (IPW) and a general
framework for online policy inference. The IPW-based estimators are
asymptotically normal under mild arm-optimality conditions. Numerical
simulations corroborate our theoretical findings. Our methods are applied to
the San Francisco parking pricing project data, revealing intriguing
discoveries and outperforming the benchmark policy.

基于矩阵完成赌徒问题 (MCB) 和在线梯度下降算法，探索碎状历史特征的在线决策问题。研究比较不同勘探概率和步长调度下的策略学习和后悔表现，同时研究基于反向反比加权 (IPW) 的去偏方法和在线策略推理的通用框架，通过实验验证理论结果，应用于旧金山停车定价项目数据，取得了引人注目的发现和超过基准策略的表现。