Chan Wu, Hanxiao Zhang, Lin Ju, Jinjing Huang, Youshao Xiao...
TL;DR大尺寸建模训练,延迟均衡化,部分冗余优化器,分层重叠环,训练效率
Abstract
As model sizes and training datasets continue to increase, large-scale model
training frameworks reduce memory consumption by various sharding techniques.
However, the huge communication overhead reduces the training efficiency,
especially in public cloud environments with varying netw