Recurrent neural networks (RNNs) have represented for years the state of the
art in neural machine translation. Recently, new architectures have been
proposed, which can leverage parallel computation on GPUs better than classical
RNNs. Faster training and inference combined with different
sequence-to-sequence modeling also lead to performance improvements. While the
new models completely depart from the original recurrent architecture, we
decided to investigate how to make RNNs more efficient. In this work, we
propose a new recurrent NMT architecture, called Simple Recurrent NMT, built on
a class of fast and weakly-recurrent units that use layer normalization and
multiple attentions. Our experiments on the WMT14 English-to-German and WMT16
English-Romanian benchmarks show that our model represents a valid alternative
to LSTMs, as it can achieve better results at a significantly lower
computational cost.

本文提出一个名为简单循环 NMT 的新的循环神经网络机器翻译体系结构，该体系结构基于一类快速且弱循环单元，使用层归一化和多个注意力机制。在 WMT14 英德和 WMT16 英罗曼尼亚基准上的实验表明，我们的模型作为 LSTMs 的有效替代品，可以在显著降低计算成本的情况下实现更好的结果。

使用弱递归单元的深度神经机器翻译

Deep Neural Machine Translation with Weakly-Recurrent Units

Layer normalization is a recently introduced technique for normalizing the
activities of neurons in deep neural networks to improve the training speed and
stability. In this paper, we introduce a new layer normalization technique
called Dynamic Layer Normalization (DLN) for adaptive neural acoustic modeling
in speech recognition. By dynamically generating the scaling and shifting
parameters in layer normalization, DLN adapts neural acoustic models to the
acoustic variability arising from various factors such as speakers, channel
noises, and environments. Unlike other adaptive acoustic models, our proposed
approach does not require additional adaptation data or speaker information
such as i-vectors. Moreover, the model size is fixed as it dynamically
generates adaptation parameters. We apply our proposed DLN to deep
bidirectional LSTM acoustic models and evaluate them on two benchmark datasets
for large vocabulary ASR experiments: WSJ and TED-LIUM release 2. The
experimental results show that our DLN improves neural acoustic models in terms
of transcription accuracy by dynamically adapting to various speakers and
environments.

本文提出了一种新的动态层标准化技术（DLN），用于自适应神经声学建模，无需额外的自适应数据或说话人信息，并且适用于变化的扬声器和环境，并证明其在大词汇 ASR 实验中提高了语音转录准确性的有效性。

适用于语音识别的自适应神经声学建模动态层归一化

Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in  Speech Recognition

Normalization techniques have only recently begun to be exploited in
supervised learning tasks. Batch normalization exploits mini-batch statistics
to normalize the activations. This was shown to speed up training and result in
better models. However its success has been very limited when dealing with
recurrent neural networks. On the other hand, layer normalization normalizes
the activations across all activities within a layer. This was shown to work
well in the recurrent setting. In this paper we propose a unified view of
normalization techniques, as forms of divisive normalization, which includes
layer and batch normalization as special cases. Our second contribution is the
finding that a small modification to these normalization schemes, in
conjunction with a sparse regularizer on the activations, leads to significant
benefits over standard normalization techniques. We demonstrate the
effectiveness of our unified divisive normalization framework in the context of
convolutional neural nets and recurrent neural networks, showing improvements
over baselines in image classification, language modeling as well as
super-resolution.

本文提出一种归一化技术 —— 分裂归一化法，包括批归一化和层归一化，并发现在使用这种技术时结合对激活函数的稀疏正则化可以提高卷积神经网络和循环神经网络的准确性。

规范化规范器：比较和扩展网络规范化方案

Normalizing the Normalizers: Comparing and Extending Network  Normalization Schemes

Training state-of-the-art, deep neural networks is computationally expensive.
One way to reduce the training time is to normalize the activities of the
neurons. A recently introduced technique called batch normalization uses the
distribution of the summed input to a neuron over a mini-batch of training
cases to compute a mean and variance which are then used to normalize the
summed input to that neuron on each training case. This significantly reduces
the training time in feed-forward neural networks. However, the effect of batch
normalization is dependent on the mini-batch size and it is not obvious how to
apply it to recurrent neural networks. In this paper, we transpose batch
normalization into layer normalization by computing the mean and variance used
for normalization from all of the summed inputs to the neurons in a layer on a
single training case. Like batch normalization, we also give each neuron its
own adaptive bias and gain which are applied after the normalization but before
the non-linearity. Unlike batch normalization, layer normalization performs
exactly the same computation at training and test times. It is also
straightforward to apply to recurrent neural networks by computing the
normalization statistics separately at each time step. Layer normalization is
very effective at stabilizing the hidden state dynamics in recurrent networks.
Empirically, we show that layer normalization can substantially reduce the
training time compared with previously published techniques.

本文提出一种基于层归一化的深度神经网络训练新方法，能够有效稳定循环神经网络中的隐藏状态动态，其训练时间较之前的技术有大幅度降低。