Paper Summary: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
A new strategy to induce reasoning capabilities (a la Chain-of-Thought, CoT) using pure Reinforcement Learning with Verifiable Rewards (RLVR), without Supervised Fine-Tuning (SFT).
Two models are published:
DeepSeek-R1-Zero: Instruction-tuned model, based on DeepSeek-V3-Base1. Uses RLVR only, no SFT.
-
DeepSeek-R1: Extended R1-Zero with SFT and more RL. More specifically:
- A cold start curated CoT dataset is used to kick-start the RL pipeline in the right direction; (see Notes)
- The reward model now includes a language consistency term to reduce the instances of multi-language or otherwise cryptic CoT;
- The RL-tuned model is used to generate SFT data pairs;
- A few extra rounds of SFT are done;
- A few extra rounds of domain-specific RL-tuning are done;
WHY
Because SFT is the most expensive part of the traditional instruction-tuning strategies such as RLHF, so we want to reduce it if possible
To see what's possible using only RL
HOW
For R1-zero, RLVR is applied directly on a "base"1 model, without any SFT. The "training data" for RL is generated by sampling the model to create math and code problems (which can be verified automatically with a math solver or a compiler), with some CoT-like helper text. The reward is given by the verifier.
For R1, the authors introduce a complex training pipeline with several additional steps, such as sampling the data to generate new data for SFT'ing, bootstrapping from earlier checkpoints from the model itself, using previous models as ground-truth judge, heuristics-based data filters etc.
CLAIMS/QUOTES
Reasoning emerges via RL: "... reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT."
Reasoning capabilities can be distilled too: "... reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models"
Performance vs OpenAI's o1: DeepSeek-R1 equals or exceeds the results of OpenAI's o1 model in several benchmarks.
Emergent Behavior: Several emergent behaviors were observed as the number of reasoning steps increases, such as aha moments, self-reflection and tracing back to previous reasoning steps and changing strategies.
EXTENDS/USES
- DeepSeek-V3-Base (DeepSeek AI, 2025)
- GRPO (Shao et al., 2024)
- Chain of Thought (Wei et al. 2023)
NOTES
-
Reward modeling: The reward used for R1-zero uses verifiable correctness checks such as: Does the generated code compile?, Does the mathematical expression give the correct result or Does the generated answer follow the specified format?
- This is now called RLVR: Reinforcement Learning from Verifiable Rewards
-
Emergent Language Hybrids The first model (R1-Zero) actually ended up using mixed languages (English, Chinese, Math notation) in the reasoning steps, because it was more efficient to do it like this.
- The new pipeline and the new Model (R1) were created to mitigate this issue (among others)
Unsuccessful attempts: The paper has a pretty interesting Unsucessful Attempts section where the authors list what didn't work. I like this.
Inducing Reasoning at training time: They used a prompt template that instructs the model to add reasoning steps within tags. A bit similar to Chain-of-thought, but at training-time.
MY 2¢
1: It's surprisingly hard to get a definition on what the
-Base
variation models are. I assume this refers to the models with self-supervised pretraining only, but I couldn't find any definitive answers in the DeepSeek papers.The new pipeline the authors created to go from R1-Zero to R1 contains a ton of interesting insights, but I'm not sure how reproducible it is. It adds quite a lot of complexity to the training flow, but apparently the results are good, so
.