Paper Summary: Multitask Prompted Training Enables Zero-Shot Task Generalization

Paper Summary: Multitask Prompted Training Enables Zero-Shot Task Generalization

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

t-zero-paper Multitask Prompted Training Enables Zero-Shot Task Generalization Source


Investigate if and how fine-tuning a vanilla LM on text-to-text NLP tasks (like T5 summary) helps it perform better on unseen tasks.


  • To see how the results from T5 generalize over unseen tasks;

  • To compare those gains (if any) with the performance from larger vanilla LMs such as GPT3;


1) Pretrain a vanilla LM using masked language modeling on the C4 Dataset;

2) Fine-tune (SFT) that model on input-output pairs of NLP tasks described in natural language;

3) Test how the model from step 2 performs when prompted to solve NLP tasks not in the SFT training set, in a zero-shot manner.


  • Fine-tuning LLMs on some NLP tasks makes them better on other, unseen NLP tasks (as compared with vanilla LMs such as GPT3). Even when using much smaller models.


  • Masked Language Modeling FTW: "We note that masked language modeling has repeatedly been shown to be a dramatically more effective pre-training strategy."


  • Architecture and model decisions from T5


  • Authors cite FLAN (summary). So FLAN came before T0.

MY 2¢

  • Seems to me it was a large article with relatively few important points. Looks like an addendum to T5 only.

  • Very similar to FLAN, in terms of scope and findings, except for the fact that FLAN claims that fine-tuning hurts model performance on unseen tasks (if the model capacity is too low).

How is T0 different from T5?

  • T5 did not investigate zero-shot performance in unseen tasks, T0 did.

  • T5 trained on more tasks than T0, but T0 used more different datasets.