Paper Summary: Language Models are Few-Shot Learners
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
GPT-3 model is introduced.
Authors show that, if you have enough data, you can start solving all kinds of problems by few-shot prompting, even beating SOTA, with no fine-tuning.
WHY
Because the usual pretraining/fine-tuning architecture for NLP tasks has some downsides:
The need to have a smaller annotated dataset for each new downstream application is still a cost/time bottleneck.
Forcing such a large pretrained model to relearn on small task-specific datasets doesn't necessarily go well.
HOW
Added more data (and more money $$) with some tweaks on top of GPT-2
CLAIMS
The more parameters a model has, the larger the performance differences between zero-, one-, and few-shot learning.
In some tasks, Few-shot (even one- or zero-shot) learning with GPT-3 175B surpasses task-specific fine-tuned models, but not in all.1
Near 100% of accuracy in adding/subtracting up to 3 digits, but gets worse as we add more digits (few-shot setting).
QUOTES
- Model size and ability to learn from context: "Larger models make increasingly efficient use of in-context information"
NOTES
They provide a consistent definition of zero-shot, one-shot and few-shot learning, i.e. the number of examples provided at inference time (in the prompt), without any inference-time weight updates.
Several models are trained, with increasing number of parameters, to test how the performance scales with more capacity (from 125M to 175B params)
One of the several tasks the model was evaluated on was to ask humans to detect if some text was model-generated or not!
GPT3 does not include any bidirectional architecture
MY 2¢
One interesting point is that they found a bug in removing overlaps between train/test data. But the cost of retraining was prohibitive and they didn't retrain the whole thing because of that!
For translation tasks, the direction of translation matters a lot (performance is better when translating into English than when translating from English)
There are so many different NLP tasks available; you can basically encode any problem as an NLP problem, provided you can represent it in words.
The data overlap problem is larger than I first thought - makes one wonder how much of that skews the results
Section 5: Limitations is a great work on the operational challenges of getting such models to train and run inference on
References
Footnotes
1: But in all likelihood, training GPT-3 with more than 175B params could change that.