Training time on large datasets for deep neural networks is the principal
workflow bottleneck in a number of important applications of deep learning,
such as object classification and detection in automatic driver assistance
systems (ADAS). To minimize training time, the training of a deep neural
network must be scaled beyond a single machine to as many mach