Paper Summary: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Paper Summary: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

pythia-biderman-et-al-2023 Pythia: A Suite for Analyzing Large Language
Models Across Training and Scaling
Source

WHAT

A framework—Pythia—to uniformly train variations of LLMs to measure the impact of hyperparameter choices:

  • number of layers
  • model dimensionality
  • number of attention heads
  • dimensionality of attention heads
  • batch size
  • learning rate

WHY

It's hard to measure the impact of hyperparameters using other published LLMs because they have been trained using different architectures, different data, and different training decisions.

HOW

  • Train 8 variations of GPT-3-like models and study the impact of changing hyperparameters on the model performance, as evaluated on several NLP tasks (via EleutherAI/lm-evaluation-harness)

CLAIMS

  • Deduplicating The Pile training dataset had no benefit on performance, contrary to existing literature.

  • Using parallel attention and MLP sublayers did not degrade performance, contrary to existing literature.

  • Using multi-lingual datasets hurt performance less than expected.

  • The position of a piece of text—i.e. at the start or the end of the training dataset— does not make it more or less likely to be memorized by the model.

  • Term frequencies in the pretraining dataset do affect the downstream performance of the model, especially in models with higher capacity.

EXTENDS/USES

  • Toolset from GPT-NeoX
  • GPT-3 for architecture and most other decisions
  • EleutherAI's The Pile dataset
  • BPE tokenizer (from GPT-NeoX-20B)
  • Flash Attention (Dao et al., 2022)
  • Rotary Embeddings (Su et al., 2021)
  • Parallel Attention (from GPT-J-6B)

NOTES

  • Authors applied interventions :thinking: in the dataset to address bias

MY 2¢

  • Most new articles are using a curated English-language dataset called The Pile

References

Dialogue & Discussion