Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
Models Across Training and Scaling
A framework—Pythia—to uniformly train variations of LLMs to measure the impact of hyperparameter choices:
- number of layers
- model dimensionality
- number of attention heads
- dimensionality of attention heads
- batch size
- learning rate
It's hard to measure the impact of hyperparameters using other published LLMs because they have been trained using different architectures, different data, and different training decisions.
- Train 8 variations of GPT-3-like models and study the impact of changing hyperparameters on the model performance, as evaluated on several NLP tasks (via EleutherAI/lm-evaluation-harness)
Deduplicating The Pile training dataset had no benefit on performance, contrary to existing literature.
Using parallel attention and MLP sublayers did not degrade performance, contrary to existing literature.
Using multi-lingual datasets hurt performance less than expected.
The position of a piece of text—i.e. at the start or the end of the training dataset— does not make it more or less likely to be memorized by the model.
Term frequencies in the pretraining dataset do affect the downstream performance of the model, especially in models with higher capacity.
- Toolset from GPT-NeoX
- GPT-3 for architecture and most other decisions
- EleutherAI's The Pile dataset
- BPE tokenizer (from GPT-NeoX-20B)
- Flash Attention (Dao et al., 2022)
- Rotary Embeddings (Su et al., 2021)
- Parallel Attention (from GPT-J-6B)
- Authors applied interventions :thinking: in the dataset to address bias
- Most new articles are using a curated English-language dataset called The Pile