Paper Summary: Attention is All you Need

Last updated: 11 Feb 2023

Table of Contents

ATTENTION
WHAT
WHY
HOW
CLAIMS
QUOTES
NOTES
MY 2¢

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

ATTENTION

Attention refers to adding a learned mask vector to a neural network model.

This mask is applied to features so that some features get amplified (attended to) while others and dampened, depending upon the task/example.

The seminal article for attention mechanisms seems to be Bahdanau et al 2014(summary here).

WHAT

Up until this article, attention mechanisms were used in addition to traditional decoder-decoder Seq2Seq architectures.

This article introduces the transformer architecture, which uses only attention, dropping recurrent connections altogether.

The specific variant of attention used is the self-attention. Even more specifically, scaled dot-product attention.

WHY

Because recurrent connections (as in RNNs and LSTMs) are naturally hard to parallelize, precluding the use of larger datasets and increasing training and inference time and cost.

HOW

Multi-head Attention: All attention layers have multiple attention heads. This means that instead of averaging all context vectors $c_i$ for each position $i$ of the input sequence, multiple context vectors are produced in parallel. They are then concatenated and combined.
- This makes the attention layers more expressive than when using the standard strategy (single-head).
Self-attention: This is how a single representation for sequences is produced without the use of recurrent connections. Instead of these more costly layers, (multi-head) attention between different parts of the sequence itself are used to produce a single representation for the sequence.
- This is where the title of the article comes from

CLAIMS

Self-attention layers are faster than recurrent layers when the sequence length is smaller than the representation dimensionality.
Self-attention yields more interpretable models than recurrent and/or convolutional nets.
Presented model establishes a new SOTA score for English-German and English-French translation, at a fraction of the training cost for previous models.
Presented model also produced near-SOTA in an unrelated task¹

QUOTES

About RNNs and LSTMs: "This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples."

NOTES

The Transformer is not a replacement for the Encoder-Decoder architecture; it is rather a replacement to RNN/LSTMs in such architectures.
- I.e. the Transformer still uses an Encoder-Decoder architecture
Each individual attention "cell" is a slightly modified version of the attention mechanism by Bahdanau et al. 2014. It uses dot-product instead of simple addition and scales down the output to avoid large values.
Masking is used in the attention layers when it is desired to block illegal connections (using information not available at inference time)
Sinoidal values, called positional encodings are added to each input embedding to allow the model to infer their order (because the values from the previous time step are not used, as there is no recurrence).

MY 2¢

Amazing to see such large amount of engineering and theoretical improvements picked up from multiple sources, added to original developments (like multi-head attention) packed into a single implementation.

1: Constituency parsing: it refers to identifying the syntactic function (pronouns, verbs, nouns) for each term in a sentence, producing what's called a parse tree.

References

Arxiv: Vaswani et al 2017: Attention is all you Need
- Code is available under tensorflow/tensor2tensor which has been replaced by google/trax
Arxiv: Bahdanau et al. 2014: Neural Machine Translation by Jointly Learning to Align and Translate

Felipe 27 Jun 2020 11 Feb 2023 paper-summary sequence-learning attention transformer-architecture