Neural networks with ReLU activations have achieved great empirical success in various domains. However, existing results for learning relu networks either pose assumptions on the underlying data distribution being e.g. Gaussian, or require the network size and/or training size to be s