The cloud is still a popular platform for distributed deep learning (DL)
training jobs since resource sharing in the cloud can improve resource
utilization and reduce overall costs. However, such sharing also brings
multiple challenges for DL training jobs, e.g., high-priority jobs cou