We consider the gradient descent flow widely used for the minimization of the
$\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two
modified versions; one adapted for the overparametrized setting, and the other
for the underparametrized setting. Both have a clear and natural invariant
geometric meaning, taking into account the pullback vector bundle structure in
the overparametrized, and the pushforward vector bundle structure in the
underparametrized setting. In the overparametrized case, we prove that,
provided that a rank condition holds, all orbits of the modified gradient
descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform
exponential convergence rate. We point out relations of the latter to
sub-Riemannian geometry.

考虑在深度学习网络中广泛使用的用于最小化 L^2 损失函数的梯度下降流，我们介绍了两个修改版本；一个适用于过参数化设置，另一个适用于欠参数化设置。两者均具有清晰且自然的不变几何意义，考虑到过参数化设置中的拉回向量丛结构和欠参数化设置中的推前向量丛结构。在过参数化情况下，我们证明，只要满足一个秩条件，所有修改后的梯度下降轨道都以统一指数收敛速度将 L^2 成本驱动到其全局最小值。我们指出了后者与次黎曼几何的关系。