This paper focuses on over-parameterized deep neural networks (DNNs) with relu activation functions and proves that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification while obtaining (nearly) zero-training error under the lazy tra