Modern deep networks are trained with stochastic gradient descent (SGD) whose key parameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $\eta$. For small $B$ and large $\eta$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the `temperature' $T\equiv \eta/B$. Yet this description is observed to break down for sufficiently large batches $B\geq B^*$, or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here we resolve these questions for a teacher-student perceptron classification model, and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the $B$-$\eta$ plane that separates three dynamical phases: $\textit{(i)}$ a noise-dominated SGD governed by temperature, $\textit{(ii)}$ a large-first-step-dominated SGD and $\textit{(iii)}$ GD. These different phases also corresponds to different regimes of generalization error. Remarkably, our analysis reveals that the batch size $B^*$ separating regimes $\textit{(i)}$ and $\textit{(ii)}$ scale with the size $P$ of the training set, with an exponent that characterizes the hardness of the classification problem.

通过对教师-学生感知器分类模型的研究，我们在B-η平面上获得了一个相图，分为三个动力学相：(i)由温度控制的噪声主导的SGD，(ii)由大步长主导的SGD和(iii)GD，这些不同相还对应着不同的泛化误差区域。有趣的是，我们的分析揭示了将相(i)和相(ii)分隔开的批次大小B*与训练集大小P呈比例，其中的指数表征了分类问题的难度。

随机梯度下降的不同制度