Paper Summary: Language Models are Few-Shot Learners

Last updated: 05 Feb 2023

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

GPT-3 model is introduced.

Authors show that, if you have enough data, you can start solving all kinds of problems by few-shot prompting, even beating SOTA, with no fine-tuning.

WHY

Because the usual pretraining/fine-tuning architecture for NLP tasks has some downsides:

The need to have a smaller annotated dataset for each new downstream application is still a cost/time bottleneck.
Forcing such a large pretrained model to relearn on small task-specific datasets doesn't necessarily go well.

HOW

Added more data (and more money $$) with some tweaks on top of GPT-2

CLAIMS

The more parameters a model has, the larger the performance differences between zero-, one-, and few-shot learning.
In some tasks, Few-shot (even one- or zero-shot) learning with GPT-3 175B surpasses task-specific fine-tuned models, but not in all.¹
Near 100% of accuracy in adding/subtracting up to 3 digits, but gets worse as we add more digits (few-shot setting).

QUOTES

Model size and ability to learn from context: "Larger models make increasingly efficient use of in-context information"

NOTES

They provide a consistent definition of zero-shot, one-shot and few-shot learning, i.e. the number of examples provided at inference time (in the prompt), without any inference-time weight updates.
Several models are trained, with increasing number of parameters, to test how the performance scales with more capacity (from 125M to 175B params)
One of the several tasks the model was evaluated on was to ask humans to detect if some text was model-generated or not!
GPT3 does not include any bidirectional architecture

MY 2¢

One interesting point is that they found a bug in removing overlaps between train/test data. But the cost of retraining was prohibitive and they didn't retrain the whole thing because of that!
For translation tasks, the direction of translation matters a lot (performance is better when translating into English than when translating from English)
There are so many different NLP tasks available; you can basically encode any problem as an NLP problem, provided you can represent it in words.
The data overlap problem is larger than I first thought - makes one wonder how much of that skews the results
Section 5: Limitations is a great work on the operational challenges of getting such models to train and run inference on

References

Brown et al 2020: Language Models are Few-Shot Learners

Footnotes

1: But in all likelihood, training GPT-3 with more than 175B params could change that.

Felipe 01 Jan 2023 05 Feb 2023 paper-summary language-models