Paper Summary: Few-shot Fine-Tuning vs In-context Learning: a Fair Comparison and Evaluation

Paper Summary: Few-shot Fine-Tuning vs In-context Learning: a Fair Comparison and Evaluation

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

alt-text Few-shot Fine-Tuning vs In-context Learning: a Fair Comparison and Evaluation Source

WHAT

Compare how LLMs generalize to out-of-domain problems when using Fine-tuning (FT) or In-context Learning (ICL) strategies.

WHY

  • Previous comparisons between FT and ICL did not take model sizes into account.

HOW

  • Several different comparisons, also checking how the results vary as the model capacities grow.

  • Models are compared in NLI and Paraphrase Identification tasks.

  • OPT (decoder-only) model variants are used, from 125M to 30B parameters.

CLAIMS

  • Both ICL and FT achieve comparable results when controlled for model size;

  • Both ICL and FT get better as model size grows, but for larger model sizes, FT is better than ICL;

  • The performance of both FT and ICL is unstable, with high variance.

  • FT is not limited to a model's context size, so it can in theory handle much more data;

  • Previous studies have misleading results because they did not compare both strategies fairly.

QUOTES

  • On model size: "ICL requires large models to work in contrast to FT, which works well even with small models"

  • On inference time performance: "...the inference time of fine-tuned models is much smaller than ICL, since it only includes the time that it takes to process the minimal pattern and the test instance. When using ICL, each test instance has to include all of the demonstrations as well, which increases the inference time."

EXTENDS/USES

  • OPT model (Zhang et al 2022)

NOTES

  • For fine-tuning, the authors use something called PBFT (Pattern-based Fine-tuning), whereby users have to provide more information about the structure of the extra input/output pairs.

  • In some cases, OOD performance is better than in-domain performance :thinking:

MY 2¢

  • "In-domain" generalization is another word for in-sample holdout. This is contrasted with "Out-of-domain" generalization.

References