Paper Summary: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Last updated: 24 Jun 2023

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

Pythia: A Suite for Analyzing Large Language
Models Across Training and Scaling
Source

WHAT

A framework—Pythia—to uniformly train variations of LLMs to measure the impact of hyperparameter choices:

number of layers
model dimensionality
number of attention heads
dimensionality of attention heads
batch size
learning rate

WHY

It's hard to measure the impact of hyperparameters using other published LLMs because they have been trained using different architectures, different data, and different training decisions.

HOW

Train 8 variations of GPT-3-like models and study the impact of changing hyperparameters on the model performance, as evaluated on several NLP tasks (via EleutherAI/lm-evaluation-harness)

CLAIMS

Deduplicating The Pile training dataset had no benefit on performance, contrary to existing literature.
Using parallel attention and MLP sublayers did not degrade performance, contrary to existing literature.
Using multi-lingual datasets hurt performance less than expected.
The position of a piece of text—i.e. at the start or the end of the training dataset— does not make it more or less likely to be memorized by the model.
Term frequencies in the pretraining dataset do affect the downstream performance of the model, especially in models with higher capacity.

EXTENDS/USES

Toolset from GPT-NeoX
GPT-3 for architecture and most other decisions
EleutherAI's The Pile dataset
BPE tokenizer (from GPT-NeoX-20B)
Flash Attention (Dao et al., 2022)
Rotary Embeddings (Su et al., 2021)
Parallel Attention (from GPT-J-6B)

NOTES

Authors applied interventions in the dataset to address bias

MY 2¢

Most new articles are using a curated English-language dataset called The Pile

References

Arxiv: Biderman et al., 2023: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Felipe 18 Jun 2023 24 Jun 2023 paper-summary language-models