Attention modules, as simple and effective tools, have not only enabled deep neural networks to achieve state-of-the-art results in many domains, but also enhanced their interpretability. Most current models use deterministic attention modules due to their simplicity and ease of optimization. Stochastic counterparts, on the other hand, are less popular despite their potential benefits. The main reason is that stochastic attention often introduces optimization issues or requires significant model changes. In this paper, we propose a scalable stochastic version of attention that is easy to implement and optimize. We construct simplex-constrained attention distributions by normalizing reparameterizable distributions, making the training process differentiable. We learn their parameters in a Bayesian framework where a data-dependent prior is introduced for regularization. We apply the proposed stochastic attention modules to various attention-based models, with applications to graph node classification, visual question answering, image captioning, machine translation, and language understanding. Our experiments show the proposed method brings consistent improvements over the corresponding baselines.

本研究提出了一种易于实现和优化的可伸缩的随机注意力版本，其特点是通过归一化可重参数化分布来构造单纯限制的注意力分布，并在基于数据的先验框架中学习其参数进行正则化，将该方法应用于各种注意力模型中，并在图形节点分类、视觉问答、图像字幕生成、机器翻译、语言理解等领域获得了一致的改进。

贝叶斯注意力模块