Knowing which words have been attended to in previous time steps while
generating a translation is a rich source of information for predicting what
words will be attended to in the future. We improve upon the attention model of
Bahdanau et al. (2014) by explicitly modeling the relation