A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.

许多成功的神经网络结构中的核心组件是一个具有非线性激活函数的两个全连接层的MLP块。我们在本文中对展示出激活稀疏性的MLP层的PAC可学习性进行了形式化研究，并呈现了多种实验结果，表明这类函数相对于非稀疏的对应物具有计算和统计上的优势。我们希望对“激活稀疏”的网络有更好的理论认识，以便能够在实践中利用激活稀疏性。

稀疏激活下的神经网络学习