Paper Summary: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Last updated: 04 May 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models by Shao et al. Source

WHAT

A domain-specific LLM with 7B params is built to solve mathematical problems.
A novel policy-gradient RL algorithm (Group Relative Policy Optimization or GRPO) is also introduced.

The objective was to increase performance in a specific domain (Math) while also reducing the resources needed to train a model for that.

DeepSeekMath training setup: A 7B model trained on a specifically built, high-quality, curated dataset. The base model was a model fine-tuned for coding tasks. SFT is done via English and Chinese preference data. The final RL-based loop uses GRPO.
GRPO: GRPO drops the need to estimate the value function to estimate the Advantage¹ (as is done in PPO). Instead, it estimates the Advantage using multiple sampled outputs from the current policy.

In GRPO, there is no Value Model present to learn the Value Function.
Adapted from Shao et al 2024

Better, curated data surpasses more data if it's lower quality.
The dataset they created is responsible for a good part of the model's overall performance.
Models fine-tuned with RL alone may surpass regular RLHF'd models
The 7B versions of DeepSeekMath outperform all other models from 7B up to 70B parameters

Larger models vs better data "Our pre-trained base model DeepSeekMath-Base 7B achieves comparable performance with Minerva 540B ... A smaller model pre-trained on high-quality data could achieve strong performance as well."
Using a LLM fine-tuned for coding as base
- "... starting from a code training model is a better choice compared to a general LLM."
- "... code training improves models’ ability to do mathematical reasoning ..."

The data preparation step was quite complex and involved. They trained an auxiliary model on a small, pristine dataset just to decide which webpages should be part of the full training set.

1: Advantage is the model residual (difference between what was predicted and the actual value), and it's what is minimized via backprop.