Paper Summary: Language Models are Unsupervised Multitask Learners

Paper Summary: Language Models are Unsupervised Multitask Learners

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

Short review of the 2019 article "Language Models are Unsupervised Multitask Learners" by Radford et al.


This is the paper that introduces the GPT-2 Transformer Model.


Authors feel that training models for specific NLP tasks (translation, inference, classification, QA, etc) and on specific domains (news, encyclopedia text, etc) hinders generalization and a new approach is proposed.

The new approach suggests mixing two current approaches:

  • Transfer learning (Pre-training + supervised fine-tuning)

  • Zero-shot learning using language models only


  • The model is a Transformer like the original GPT, with a few optimizations1, much more data and a much higher model capacity (1.5 billion parameters)


  • Provides good results in several NLP tasks/datasets in zero-shot setting

    • Short-range /long-range inference (surpasses SOTA)
    • Reading comprehension (matches SOTA)
    • Common-sense reasoning
    • Question-answering
    • Summarization
    • Translation


  • "The current best performing systems on language tasks utilize a combination of pre-training and supervised fine-tuning."

  • Word-level language models are much quicker than character-level one but we obviously cannot train a word-level language model with a vocabulary containing all possible words.

  • "common image datasets contain a non-trivial amount of near-duplicate images. For instance CIFAR-10 has 3.3% overlap between train and test images"

  • " a model trained to generate Wikipedia articles also learned to translate names between languages". Very interesting. From Liu et al. 2018.


  • Open AI has not released the full trained model due to its being just too good and therefore being prone to misuse by people inpersonating others and generating text that's just too real

  • Related to the NLP Decathlon Article

  • Authors used neither word-level nor char-level language modelling, but a hybrid approach called Byte Pair Encoding (BPE)

  • Very clever: in order to force the model to provide a summary, they provided "TL;DR" as seed words

MY 2¢

  • Interesting to realize that, while a language model is just a probability

$$ p(\text{output word} \ | \ \text{context}) $$

all other NLP tasks can also be represented in a similar probabilistc way (just one extra conditioning)

$$ p(\text{output word} \ | \ \text{context},\text{task}) $$

1: Models use layer normalization throughout, vocabulary size is around 50k tokens