BriefGPT.xyz
May, 2023
利用动态批处理驯服分布式机器学习训练中的资源异质性
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching
HTML
PDF
Sahil Tyagi, Prateek Sharma
TL;DR
本文提出一种动态数据并行训练技术,该技术通过使用比例控制和 PID 控制器的思想,在异构计算集群上均等迭代时间、调整 mini-batch 大小,从而减少模型训练时间。
Abstract
Current techniques and systems for
distributed model training
mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However,
cluster heterogeneity
is pervasive in
→