Paper Summary: Multitask Prompted Training Enables Zero-Shot Task Generalization

Paper Summary: Multitask Prompted Training Enables Zero-Shot Task Generalization

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

t-zero-paper Multitask Prompted Training Enables Zero-Shot Task Generalization Source

WHAT

Investigate if and how fine-tuning a vanilla LM on text-to-text NLP tasks (like T5 summary) helps it perform better on unseen tasks.

WHY

  • To see how the results from T5 generalize over unseen tasks;

  • To compare those gains (if any) with the performance from larger vanilla LMs such as GPT3;

HOW

1) Pretrain a vanilla LM using masked language modeling on the C4 Dataset;

2) Fine-tune (SFT) that model on input-output pairs of NLP tasks described in natural language;

3) Test how the model from step 2 performs when prompted to solve NLP tasks not in the SFT training set, in a zero-shot manner.

CLAIMS

  • Fine-tuning LLMs on some NLP tasks makes them better on other, unseen NLP tasks (as compared with vanilla LMs such as GPT3). Even when using much smaller models.

QUOTES

  • Masked Language Modeling FTW: "We note that masked language modeling has repeatedly been shown to be a dramatically more effective pre-training strategy."

EXTENDS/USES

  • Architecture and model decisions from T5

NOTES

  • Authors cite FLAN (summary). So FLAN came before T0.

MY 2¢

  • Seems to me it was a large article with relatively few important points. Looks like an addendum to T5 only.

  • Very similar to FLAN, in terms of scope and findings, except for the fact that FLAN claims that fine-tuning hurts model performance on unseen tasks (if the model capacity is too low).

How is T0 different from T5?

  • T5 did not investigate zero-shot performance in unseen tasks, T0 did.

  • T5 trained on more tasks than T0, but T0 used more different datasets.


References

Dialogue & Discussion