Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
This article introduces the BERT model, which is a type of transformer-based fine-tuning3 architecture for all sorts of NLP tasks.
BERT introduces bidirectional self-attention to Transformers (instead of left-to-right only) and combine both token-level and sentence-level self-supervision so that the model is good both levels of tasks.
Verify if transfer-learning approaches can also benefit from bidirectional architectures.
Test different self-supervision strategies (token-level and sentence-level) together.
Two steps: Pre-training and fine-tuning
Self-supervision target. BERT uses two tasks:
- A masked language model, AKA the Cloze task whereby one word at random is masked an the net must predict it from surrounding words.
- "Next sentence prediction" self-supervision target in addition to the above. (Binarized, as in a 1 or 0 target)
Bidirectional Transformers: BERT uses bidirectional self-attention (vanilla Transformers use left-only self-attention)
Encoding: Input embeddings are actually a sum of the raw token embeddings (WordPiece), segment embeddings to tell which sentence it's from and a sine/cosine positional embedding.
SOTA scores for many NLP tasks and benchmarks such as GLUE and SQuAD.
Better results than GPT-1 with the same number of parameters
Feature-based adaptation vs fine-tuning: "There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning"
Architecture: "A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture."
They mention that the Billion Word Benchmark is a collection of shuffled sentences and this hurts document-grain comprehension.
During the fine-tuning task, all pre-trained parameters are updated. No frozen layers.
BERT can be used to just produce embeddings to be used downstream too. It performs slightly worse than in the fine-tuning approach but is still very good.
- Note that it's possible to use several model layers as embeddings, not just the last layer!
Very important point: left-only (as in, unidirectional) Transformers are also called Transformer Decoders (because they can be used to generate text) while bidirectional transformers are called Transformer Encoders in the literature.
1: One example of a feature-based strategy is Peters et al, 2018: Deep Contextualized Word Representations
2: Fine-tuning is the strategy used by GPT-1 (Radford et al, 2018)
3: As opposed to feature-based (see #quotes)