Paper Summary: LLaMA: Open and Efficient Foundation Language Models

Paper Summary: LLaMA: Open and Efficient Foundation Language Models

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

An LLM (LLaMA) is trained from scratch using more data but fewer training iterations than GPT3. Only public data is used.

WHY

To test how the tradeoff data vs compute budget behaves as the scale grows.

HOW

LLaMA is a standard Transformer LLM with some optimizations used by previous LMs. It's trained exclusively on open-access data.

CLAIMS

  • Models with fewer parameters are cheaper to use at inference time

  • LLaMA outperforms or matches LMs having 3-10x the number of parameters (GPT3, Gopher, Chinchilla) at most natural language tasks (Zero-shot and Few-shot)

EXTENDS/USES

  • AdamW Optimizer
  • Transformers Implementation from facebookresearch/xformers
  • RMSNorm
  • SwiGLU Activation Function
  • Rotary Embeddings from GPTNeo

NOTES

  • Total number of tokens used for training: 1.4T

  • Some fine-tuning was done using simple SFT

MY 2¢

  • This is an engineering article; not many theoretical advancements.

  • The moat enjoyed by big players gets smaller every day.


References

Dialogue & Discussion