The regularization and output consistency offered by dropout and layer-wise pretraining for learning deep networks have been well studied. However, our understanding about the explicit convergence of parameter estimates, and their dependence on structural (like depth and layer lengths) and learning (like denoising and dropout rate) aspects is less mature. An interesting question is to ask if the network architecture and input statistics could ``guide'' the choices of such learning parameters, and vice versa. In this work, we explore these gaps between the structural, distributional and learnability aspects of deep networks, and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence for general nonconvex objectives using first-order information. Within this framework, we show an interesting relationship between feature denoising and dropout, and subsequently derive the convergence rates of multi-layer networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.

研究了背景传播、深度网络、退出、参数收敛和特征去噪等方面的相互关系，提出了一种基于目标函数的反向传播收敛性分析框架，并通过实验验证了其正确性。

深度学习中网络结构和梯度收敛的相互作用