We present MLTCP, a technique to augment today's congestion control algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication phases of jobs that compete for network bandwidth to interleave with each other, thereby utilizing the network efficiently. At the heart of MLTCP lies a very simple principle based on a key conceptual insight: DNN training flows should scale their congestion window size based on the number of bytes sent at each training iteration. We show that integrating this principle into today's congestion control protocols is straightforward: by adding 30-60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of different jobs into an interleaved state within a few training iterations, regardless of the number of competing flows or the start time of each flow. Our experiments with popular DNN training jobs demonstrate that enabling MLTCP accelerates the average and 99th percentile training iteration time by up to 2x and 4x, respectively.

我们提出了MLTCP技术，通过将竞争网络带宽的作业的通信阶段相互交错，以有效利用网络，从而增加共享GPU集群中的深度神经网络训练作业的速度。在MLTCP的核心是一个基于关键概念洞察的非常简单的原则：DNN训练流应基于每个训练迭代发送的字节数来调整其拥塞窗口大小。通过向Reno、CUBIC或DCQCN添加30-60行代码，我们证明了将这一原则集成到现有的拥塞控制协议中是直接的：不论竞争流的数量或每个流的开始时间如何，MLTCP将不同作业的流稳定到一个交错状态只需要几个训练迭代。我们对流行的DNN训练作业进行的实验证明，启用MLTCP将平均和第99百分位数的训练迭代时间分别加速了2倍和4倍。

MLTCP: 深度神经网络训练的拥塞控制