We present an end-to-end framework for solving Vehicle Routing Problem (VRP) using deep reinforcement learning. In this approach, we train a single model that finds near-optimal solutions for problem instances sampled from a given distribution, only by observing the reward signals and following feasibility rules. Our model represents a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model produces the solution as a sequence of consecutive actions in real time, without the need to re-train for every new problem instance. Our method is faster in both training and inference than a recent method that solves the Traveling Salesman Problem (TSP), with nearly identical solution quality. On the more general VRP, our approach outperforms classical heuristics on medium-sized instances in both solution quality and computation time (after training). Our proposed framework can be applied to variants of the VRP such as the stochastic VRP, and has the potential to be applied more generally to combinatorial optimization problems.

本文提出了一个端到端的框架，使用强化学习来解决车辆路径问题(VRP)，我们训练一个单一的模型，仅通过观察奖励信号和遵守可行性规则，就可以找到给定分布采样的问题实例的近最优解。通过将策略梯度算法应用于优化其参数，我们的模型在实时中以连续操作序列的形式生成解决方案，无需为每个新问题实例重新训练，我们的方法在解决负载容量VRP的中等规模实例时，在解决质量上优于经典的启发式算法和Google的OR-Tools，同时具有可比较的计算时间，在本文中还探讨了分裂交付对解决质量的影响。我们的提出的框架可以应用于其他VRP变体，例如随机VRP，并具有应用于组合优化问题的潜力。

用强化学习解决车辆路径规划问题