Paper Summary: Language Models are Few-Shot Learners

Paper Summary: Language Models are Few-Shot Learners

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.


GPT-3 model is introduced.

Authors show that, if you have enough data, you can start solving all kinds of problems by few-shot prompting, even beating SOTA, with no fine-tuning.


Because the usual pretraining/fine-tuning architecture for NLP tasks has some downsides:

  • The need to have a smaller annotated dataset for each new downstream application is still a cost/time bottleneck.

  • Forcing such a large pretrained model to relearn on small task-specific datasets doesn't necessarily go well.


Added more data (and more money $$) with some tweaks on top of GPT-2


  • The more parameters a model has, the larger the performance differences between zero-, one-, and few-shot learning.

  • In some tasks, Few-shot (even one- or zero-shot) learning with GPT-3 175B surpasses task-specific fine-tuned models, but not in all.1

  • Near 100% of accuracy in adding/subtracting up to 3 digits, but gets worse as we add more digits (few-shot setting).


  • Model size and ability to learn from context: "Larger models make increasingly efficient use of in-context information"


  • They provide a consistent definition of zero-shot, one-shot and few-shot learning, i.e. the number of examples provided at inference time (in the prompt), without any inference-time weight updates.

  • Several models are trained, with increasing number of parameters, to test how the performance scales with more capacity (from 125M to 175B params)

  • One of the several tasks the model was evaluated on was to ask humans to detect if some text was model-generated or not!

  • GPT3 does not include any bidirectional architecture

MY 2¢

  • One interesting point is that they found a bug in removing overlaps between train/test data. But the cost of retraining was prohibitive and they didn't retrain the whole thing because of that!

  • For translation tasks, the direction of translation matters a lot (performance is better when translating into English than when translating from English)

  • There are so many different NLP tasks available; you can basically encode any problem as an NLP problem, provided you can represent it in words.

  • The data overlap problem is larger than I first thought - makes one wonder how much of that skews the results

  • Section 5: Limitations is a great work on the operational challenges of getting such models to train and run inference on



1: But in all likelihood, training GPT-3 with more than 175B params could change that.