Paper Summary: Language Models are Few-Shot Learners

# Paper Summary: Language Models are Few-Shot Learners

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

## WHAT

GPT-3 model is introduced.

Authors show that, if you have enough data, you can start solving all kinds of problems by few-shot prompting, even beating SOTA, with no fine-tuning.

## WHY

Because the usual pretraining/fine-tuning architecture for NLP tasks has some downsides:

• The need to have a smaller annotated dataset for each new downstream application is still a cost/time bottleneck.

• Forcing such a large pretrained model to relearn on small task-specific datasets doesn't necessarily go well.

## HOW

Added more data (and more money ) with some tweaks on top of GPT-2

## CLAIMS

• The more parameters a model has, the larger the performance differences between zero-, one-, and few-shot learning.

• In some tasks, Few-shot (even one- or zero-shot) learning with GPT-3 175B surpasses task-specific fine-tuned models, but not in all.1

• Near 100% of accuracy in adding/subtracting up to 3 digits, but gets worse as we add more digits (few-shot setting).

## QUOTES

• Model size and ability to learn from context: "Larger models make increasingly efficient use of in-context information"

## NOTES

• They provide a consistent definition of zero-shot, one-shot and few-shot learning, i.e. the number of examples provided at inference time (in the prompt), without any inference-time weight updates.

• Several models are trained, with increasing number of parameters, to test how the performance scales with more capacity (from 125M to 175B params)

• One of the several tasks the model was evaluated on was to ask humans to detect if some text was model-generated or not!

• GPT3 does not include any bidirectional architecture

## MY 2¢

• One interesting point is that they found a bug in removing overlaps between train/test data. But the cost of retraining was prohibitive and they didn't retrain the whole thing because of that!

• For translation tasks, the direction of translation matters a lot (performance is better when translating into English than when translating from English)

• There are so many different NLP tasks available; you can basically encode any problem as an NLP problem, provided you can represent it in words.

• The data overlap problem is larger than I first thought - makes one wonder how much of that skews the results

• Section 5: Limitations is a great work on the operational challenges of getting such models to train and run inference on

### Footnotes

1: But in all likelihood, training GPT-3 with more than 175B params could change that.