Paper Summary: Attention is All you Need

Paper Summary: Attention is All you Need

Last updated:
Table of Contents

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.


Attention refers to adding a learned mask vector to a neural network model.

This mask is applied to features so that some features get amplified (attended to) while others and dampened, depending upon the task/example.

The seminal article for attention mechanisms seems to be Bahdanau et al 2014(summary here).


Up until this article, attention mechanisms were used in addition to traditional decoder-decoder Seq2Seq architectures.

This article introduces the transformer architecture, which uses only attention, dropping recurrent connections altogether.

The specific variant of attention used is the self-attention. Even more specifically, scaled dot-product attention.


Because recurrent connections (as in RNNs and LSTMs) are naturally hard to parallelize, precluding the use of larger datasets and increasing training and inference time and cost.


  • Multi-head Attention: All attention layers have multiple attention heads. This means that instead of averaging all context vectors \(c_i\) for each position \(i\) of the input sequence, multiple context vectors are produced in parallel. They are then concatenated and combined.

    • This makes the attention layers more expressive than when using the standard strategy (single-head).
  • Self-attention: This is how a single representation for sequences is produced without the use of recurrent connections. Instead of these more costly layers, (multi-head) attention between different parts of the sequence itself are used to produce a single representation for the sequence.

    • This is where the title of the article comes from


  • Self-attention layers are faster than recurrent layers when the sequence length is smaller than the representation dimensionality.

  • Self-attention yields more interpretable models than recurrent and/or convolutional nets.

  • Presented model establishes a new SOTA score for English-German and English-French translation, at a fraction of the training cost for previous models.

  • Presented model also produced near-SOTA in an unrelated task1


  • About RNNs and LSTMs: "This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples."


  • The Transformer is not a replacement for the Encoder-Decoder architecture; it is rather a replacement to RNN/LSTMs in such architectures.

    • I.e. the Transformer still uses an Encoder-Decoder architecture
  • Each individual attention "cell" is a slightly modified version of the attention mechanism by Bahdanau et al. 2014. It uses dot-product instead of simple addition and scales down the output to avoid large values.

  • Masking is used in the attention layers when it is desired to block illegal connections (using information not available at inference time)

  • Sinoidal values, called positional encodings are added to each input embedding to allow the model to infer their order (because the values from the previous time step are not used, as there is no recurrence).

MY 2¢

Amazing to see such large amount of engineering and theoretical improvements picked up from multiple sources, added to original developments (like multi-head attention) packed into a single implementation.

1: Constituency parsing: it refers to identifying the syntactic function (pronouns, verbs, nouns) for each term in a sentence, producing what's called a parse tree.