Paper Summary: Neural Machine Translation by Jointly Learning to Align and Translate

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.


Authors introduce attention, which is a way to enhance encoder-decoder architectures for sequence-to-sequence (seq2seq) learning. This is applied to Neural Machine Translation (NMT) as an example.

Attention mechanisms are a way to force models to select (attend to) some parts of the sequence input, rather than the whole input.


Because the traditional encoder-decoder architecture is limited by a bottleneck where information passes from input to output.

This happens because the input sequence is compressed to a single fixed-width vector and then it's converted into the output sequence.

Twitter Linkedin YC Hacker News Reddit


In the new proposed architecture, the input sequence is first projected into multiple vectors and the attention mechanism learns to combine/choose those to produce the output sequence.

In practice, what happens is that there is an individual fixed-width representation (a context) for each input element1.

These element-specific contexts are learned jointly with the seq2seq task and they are built off two components:

  • 1) information about other elements surrounding element \(i\) (called annotations on each element \(i\))

  • 2) information about how strong each element should impact the output token (i.e. weights)2


  • Proposed method outperforms vanilla RNN-based encoder-decoder

  • Proposed method outperforms traditional phrase-based systems even though they are only evaluated in-vocabulary and use more data

  • The performance of vanilla RNN-based encode-decoders drops dramatically with longer sequences, while that of the proposed method shows no such deterioration.

Twitter Linkedin YC Hacker News Reddit


  • In the proposed model, the architecture is still an encoder-decoder, but what's in the middle is not a single vector but multiple ones.

  • By aligning, the authors mean matching words across different languages that be be in different order, such as the french word Zone that has been matched with Area, in order to translate European Economic Area into Zone Économique Européen.

  • Visualizing the alignment weights for each word is very useful in the case of translating

1: In traditional encoder-decoders, all input elements are squashed into a single fixed-width vector.

2: The weights are also learned via an embedded, small neural net referred to as a alignment model.


Dialogue & Discussion