Paper Summary: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Last updated: 23 Jul 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

This article introduces the BERT model, which is a type of transformer-based fine-tuning³ architecture for all sorts of NLP tasks.

BERT introduces bidirectional self-attention to Transformers (instead of left-to-right only) and combine both token-level and sentence-level self-supervision so that the model is good both levels of tasks.

WHY

Verify if transfer-learning approaches can also benefit from bidirectional architectures.

Test different self-supervision strategies (token-level and sentence-level) together.

HOW

Two steps: Pre-training and fine-tuning
Self-supervision target. BERT uses two tasks:
- A masked language model, AKA the Cloze task whereby one word at random is masked an the net must predict it from surrounding words.
- "Next sentence prediction" self-supervision target in addition to the above. (Binarized, as in a 1 or 0 target)
Bidirectional Transformers: BERT uses bidirectional self-attention (vanilla Transformers use left-only self-attention)
Encoding: Input embeddings are actually a sum of the raw token embeddings (WordPiece), segment embeddings to tell which sentence it's from and a sine/cosine positional embedding.

CLAIMS

SOTA in multiple NLP tasks: SOTA scores for many NLP tasks and benchmarks such as GLUE and SQuAD.
Better than GPT-1 at the same capacity: Better results than GPT-1 with the same number of parameters

QUOTES

Feature-based adaptation vs fine-tuning: "There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning"
- Feature-based: "task-specific architectures that include the pre-trained representations as additional features"¹
- Fine-tuning: "introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters"²
Architecture: "A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture."

NOTES

BERT is an example of a encoder-only model. It cannot be used for generating text, as text-generation is done by the decoder .
That fact that the Billion Word Benchmark is a collection of shuffled sentences hurts document-level comprehension.
During the fine-tuning task, all pre-trained parameters are updated. No frozen layers.
BERT can be used to just produce embeddings to be used downstream. It performs slightly worse than in the fine-tuning approach but is still very good.
- It's possible to use several model layers as embeddings, not just the last layer!

MY 2¢

Very important point: left-only (as in, unidirectional) Transformers are also called Transformer Decoders (because they can be used to generate text) while bidirectional transformers are called Transformer Encoders in the literature.

References

Devlin et al 2019 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Footnotes

1: One example of a feature-based strategy is Peters et al, 2018: Deep Contextualized Word Representations

2: Fine-tuning is the strategy used by GPT-1 (Radford et al, 2018)

3: As opposed to feature-based (see quotes)

Felipe 01 Jan 2023 23 Jul 2025 paper-summary language-models