Paper Summary: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
A new strategy to train reasoning capabilities (a la Chain-of-Thought, CoT) using pure Reinforcement Learning (RL), without Supervised Fine-Tuning (SFT).
Two models are published:
DeepSeek-R1-Zero: Instruction-tuned model with CoT, based on DeepSeek-V3-Base1. Uses RL only; no SFT.
-
DeepSeek-R1: Similar to R1-Zero, with the following extra steps:
- A cold start curated CoT dataset is used to kick-start the RL pipeline in the right direction; (see Notes)
- The reward model now includes a language consistency term to reduce the instances of multi-language or otherwise cryptic CoT;
- The RL-tuned model is used to generate SFT data pairs;
- A few extra rounds of SFT are done;
- A few extra rounds of domain-specific RL-tuning are done;
WHY
Because SFT is the most expensive part of the traditional instruction-tuning strategies such as RLHF.
HOW
RL-based fine-tuning is applying directly on a "base"1 model, without any SFT.
The authors introduce a complex training pipeline with several additional steps, such as sampling the data to generate new data for SFT'ing, bootstrapping from earlier checkpoints from the model itself, using previous models as ground-truth judge, heuristics-based data filters etc.
CLAIMS/QUOTES
No need for SFT: "... reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT."
Reasoning capabilities can be distilled too: "... reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models"
Performance vs OpenAI's o1: DeepSeek-R1 equals or exceeds the results of OpenAI's o1 model in several benchmarks.
Emergent Behavior: Several emergent behaviors were observed as the number of reasoning steps increases, such as aha moments, self-reflection and tracing back to previous reasoning steps and changing strategies.
EXTENDS/USES
- DeepSeek-V3-Base (DeepSeek AI, 2025)
- GRPO (Shao et al., 2024)
- Chain of Thought (Wei et al. 2023)
NOTES
Reward modeling: The reward used doesn't just take into account how factually correct the answer is, but also things such as: Does the generated code compile?, or Does the generated answer follow the specified format?
-
Emergent Language Hybrids The first model (R1-Zero) actually ended up using mixed languages (English, Chinese, Math notation) in the reasoning steps, because it was more efficient to do it like this.
- The new pipeline and the new Model (R1) were created to mitigate this issue (among others)
Unsuccessful attempts: The paper has a pretty interesting Unsucessful Attempts section where the authors list what didn't work. I like this.
MY 2¢
1: It's surprisingly hard to get a definition on what the
-Base
variation models are. I assume this refers to the models with self-supervised pretraining only, but I couldn't find any definitive answers in the DeepSeek papers.The new pipeline the authors created to go from R1-Zero to R1 contains a ton of interesting insights, but I'm not sure how reproducible it is. It adds quite a lot of complexity to the training flow, but apparently the results are good, so
.