BriefGPT.xyz
Jul, 2022
RoCE拥塞控制策略对分布式DNN训练的影响
Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
HTML
PDF
Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella...
TL;DR
本文分析了一些最先进的RoCE拥塞控制方案在分布式训练平台上的性能,结果表明,为了提高分布式训练平台和负载性能,需要基于分布式训练平台和负载特性设计一种优化的、低开销的拥塞控制方案。
Abstract
rdma
over Converged Ethernet (
roce
) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the
→