In this study, a real-time dispatching algorithm based on reinforcement
learning is proposed and for the first time, is deployed in large scale.
Current dispatching methods in ridehailing platforms are dominantly based on
myopic or rule-based non-myopic approaches. Reinforcement learning enables
dispatching policies that are informed of historical data and able to employ
the learned information to optimize returns of expected future trajectories.
Previous studies in this field yielded promising results, yet have left room
for further improvements in terms of performance gain, self-dependency,
transferability, and scalable deployment mechanisms. The present study proposes
a standalone RL-based dispatching solution that is equipped with multiple
mechanisms to ensure robust and efficient on-policy learning and inference
while being adaptable for full-scale deployment. A new form of value updating
based on temporal difference is proposed that is more adapted to the inherent
uncertainty of the problem. For the driver-order assignment, a customized
utility function is proposed that when tuned based on the statistics of the
market, results in remarkable performance improvement and interpretability. In
addition, for reducing the risk of cancellation after drivers' assignment, an
adaptive graph pruning strategy based on the multi-arm bandit problem is
introduced. The method is evaluated using offline simulation with real data and
yields notable performance improvement. In addition, the algorithm is deployed
online in multiple cities under DiDi's operation for A/B testing and is
launched in one of the major international markets as the primary mode of
dispatch. The deployed algorithm shows over 1.3% improvement in total driver
income from A/B testing. In addition, by causal inference analysis, as much as
5.3% improvement in major performance metrics is detected after full-scale
deployment.

本研究提出了一种基于强化学习的实时调度算法，采用了新型的时间差异价值更新方法，并引入了自适应图剪枝策略，实现了 A/B 测试下司机收入总量提升超过 1.3％和全面部署后主要性能指标提升达到 5.3％的显著性能提升。