Paper Summary: LLaMA: Open and Efficient Foundation Language Models
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
An LLM (LLaMA) is trained from scratch using more data but fewer training iterations than GPT3. Only public data is used.
WHY
To test how the tradeoff data vs compute budget behaves as the scale grows.
HOW
LLaMA is a standard Transformer LLM with some optimizations used by previous LMs. It's trained exclusively on open-access data.
CLAIMS
Models with fewer parameters are cheaper to use at inference time
LLaMA outperforms or matches LMs having 3-10x the number of parameters (GPT3, Gopher, Chinchilla) at most natural language tasks (Zero-shot and Few-shot)
EXTENDS/USES
- AdamW Optimizer
- Transformers Implementation from facebookresearch/xformers
- RMSNorm
- SwiGLU Activation Function
- Rotary Embeddings from GPTNeo
NOTES
Total number of tokens used for training: 1.4T
Some fine-tuning was done using simple SFT
MY 2¢
This is an engineering article; not many theoretical advancements.
The moat enjoyed by big players gets smaller every day.