Paper Summary: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper Summary: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.


This article introduces the BERT model, which is a type of transformer-based fine-tuning3 architecture for all sorts of NLP tasks.

BERT introduces bidirectional self-attention to Transformers (instead of left-to-right only) and combine both token-level and sentence-level self-supervision so that the model is good both levels of tasks.


Verify if transfer-learning approaches can also benefit from bidirectional architectures.

Test different self-supervision strategies (token-level and sentence-level) together.


  • Two steps: Pre-training and fine-tuning

  • Self-supervision target. BERT uses two tasks:

    • A masked language model, AKA the Cloze task whereby one word at random is masked an the net must predict it from surrounding words.
    • "Next sentence prediction" self-supervision target in addition to the above. (Binarized, as in a 1 or 0 target)
  • Bidirectional Transformers: BERT uses bidirectional self-attention (vanilla Transformers use left-only self-attention)

  • Encoding: Input embeddings are actually a sum of the raw token embeddings (WordPiece), segment embeddings to tell which sentence it's from and a sine/cosine positional embedding.


  • SOTA scores for many NLP tasks and benchmarks such as GLUE and SQuAD.

  • Better results than GPT-1 with the same number of parameters


  • Feature-based adaptation vs fine-tuning: "There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning"

    • Feature-based: "task-specific architectures that include the pre-trained representations as additional features"1
    • Fine-tuning: "introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters"2
  • Architecture: "A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture."


  • They mention that the Billion Word Benchmark is a collection of shuffled sentences and this hurts document-grain comprehension.

  • During the fine-tuning task, all pre-trained parameters are updated. No frozen layers.

  • BERT can be used to just produce embeddings to be used downstream too. It performs slightly worse than in the fine-tuning approach but is still very good.

    • Note that it's possible to use several model layers as embeddings, not just the last layer!

MY 2¢

Very important point: left-only (as in, unidirectional) Transformers are also called Transformer Decoders (because they can be used to generate text) while bidirectional transformers are called Transformer Encoders in the literature.



1: One example of a feature-based strategy is Peters et al, 2018: Deep Contextualized Word Representations

2: Fine-tuning is the strategy used by GPT-1 (Radford et al, 2018)

3: As opposed to feature-based (see quotes)