Paper Summary: Multitask Prompted Training Enables Zero-Shot Task Generalization
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
Investigate if and how fine-tuning a vanilla LM on text-to-text NLP tasks (like T5 summary) helps it perform better on unseen tasks.
WHY
To see how the results from T5 generalize over unseen tasks;
To compare those gains (if any) with the performance from larger vanilla LMs such as GPT3;
HOW
1) Pretrain a vanilla LM using masked language modeling on the C4 Dataset;
2) Fine-tune (SFT) that model on input-output pairs of NLP tasks described in natural language;
3) Test how the model from step 2 performs when prompted to solve NLP tasks not in the SFT training set, in a zero-shot manner.
CLAIMS
- Fine-tuning LLMs on some NLP tasks makes them better on other, unseen NLP tasks (as compared with vanilla LMs such as GPT3). Even when using much smaller models.
QUOTES
- Masked Language Modeling FTW: "We note that masked language modeling has repeatedly been shown to be a dramatically more effective pre-training strategy."
EXTENDS/USES
- Architecture and model decisions from T5
NOTES
- Authors cite FLAN (summary). So FLAN came before T0.
MY 2¢
Seems to me it was a large article with relatively few important points. Looks like an addendum to T5 only.
Very similar to FLAN, in terms of scope and findings, except for the fact that FLAN claims that fine-tuning hurts model performance on unseen tasks (if the model capacity is too low).
How is T0 different from T5?
T5 did not investigate zero-shot performance in unseen tasks, T0 did.
T5 trained on more tasks than T0, but T0 used more different datasets.