Paper Summary: Neural Machine Translation by Jointly Learning to Align and Translate

Paper Summary: Neural Machine Translation by Jointly Learning to Align and Translate

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.


Authors introduce attention, which is a way to enhance encoder-decoder architectures for sequence-to-sequence (seq2seq) learning. This is applied to Neural Machine Translation (NMT) as an example.

Attention mechanisms are a way to force models to select (attend to) some parts of the sequence input, rather than the whole input.


Because the traditional encoder-decoder architecture is limited by a bottleneck where information passes from input to output.

This happens because the input sequence is compressed to a single fixed-width vector and then it's converted into the output sequence.


In the new proposed architecture, the input sequence is first projected into multiple vectors and the attention mechanism learns to combine/choose from those to produce the output sequence.

In practice, what happens is that there is an individual fixed-width representation (a context) for each input element1.

These element-specific contexts are learned jointly with the seq2seq task and they are built off two components:

  • 1) information about other elements surrounding element \(i\) (called annotations on each element \(i\))

  • 2) information about how strongly each element should impact the output token (i.e. weights)2


  • Proposed method outperforms vanilla RNN-based encoder-decoder

  • Proposed method outperforms traditional phrase-based systems even though they are only evaluated in-vocabulary and use more data

  • The performance of vanilla RNN-based encode-decoders drops dramatically with longer sequences, while that of the proposed method shows no such deterioration.


  • Summary of attention mechanisms: "The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation."


  • In the proposed model, the architecture is still an encoder-decoder, but what's in the middle is not a single vector but multiple ones.

  • By aligning, the authors mean matching words across different languages that can be in different order, such as the french word Zone that has been matched with Area, in order to translate European Economic Area into Zone Économique Européen.

    • Here the word Zone is correctly matched (aligned) with Area, even though it appears at the end of one sentence and at the beginning of another.
  • Visualizing the alignment weights for each word is very useful in the case of translating

  • The attention mechanism (alignment model) is only used in the decoder model.

1: In traditional encoder-decoders, all input elements are squashed into a single fixed-width vector.

2: The weights are also learned via an embedded, small neural net referred to as a alignment model.


Dialogue & Discussion