Over the last week I’ve been going deep into DeepSeek to better understand how they built the R1 model, and what made their process so different from everyone else. Along this journey I’ve found a few people who have really helped improve my understanding of some of the really novel things DeepSeek has been doing, and without a doubt Alexandr Wang, the CEO of Scale has been the most helpful.
On that note, I want to start with this tweet from Alexandr that I think does such a good job of breaking down what makes the training process of DeepSeek so unique.
Before we dive in, there’s two key terms you should know, that I’m sure many of you know already, but I need to cover anyways otherwise nothing I talk about below will make any sense 🙃
SFT - stands for supervised fine tuning. SFT is a way to take an existing model, and fine tune it using a labeled dataset. Think of your brain as a super powerful reasoning model, I might use SFT by showing you a bunch of questions about Star Trek and the answers to those questions. Now you have some additional specialized knowledge and could pass as a Trekkie 🖖
RLHF - reinforcement learning from human feedback. This is pretty much exactly what it sounds like, unlike SFT that involves training a model with a labeled dataset, RLHF is all about training a model with human feedback. The concept hinges on humans giving feedback, and using this to build a rewards system that the model can then use to improve 🏆
Okay, SFT and RLHF are now under your belt, so let’s dive in.
At a high-level what Alexandr is saying here is pretty powerful, and the Tesla example he gives is a good one. DeepSeek places a ton of emphasis on data annotation. If you want to know why Tesla is so far ahead of every other car company when it comes to self-driving, data annotation, from humans is at the core of this advantage. This is also how Waymo was able to be the first fully autonomous taxi service in San Francisco.
I can still remember seeing Waymo’s driving around SF for years with people sitting in them as they drove all over the city. After a while I thought - why the heck can’t these things drive themself yet? And the answer is, they needed more data, from humans.
Now let’s go a step further, and for this concept I’m going to reference another tweet here that really gets to the core of how DeepSeek is able to leverage this human data in such an interesting way.
What Chamath talks about here is what everyone is talking about when it comes to what makes R1 truly unique - the model essentially had a starting point (and a very good one), but then it continued to get better on its own. At the core of this is DeepSeek’s novel approach to reward modeling.
I’ll pull a specific quote out of this tweet because I think it might just be the nugget of nuggets:
Rather than using complex neural reward models that can lead to reward hacking, they developed a clever rule-based system that combines accuracy rewards (verifying final answers) with format rewards (encouraging structured thinking)
So let’s go deeper on this one because I think it gets to the core of why DeepSeek is as good as it is, and was able to get that good at a fraction of the cost. Now we’re going to talk about an algorithm, a breakthrough algorithm, here it is below, and don’t worry if you don’t understand it in detail, I’m certainly not going to pretend that I do.
The image above shows the high-level logic for an algorithm called Iterative Group Relative Policy Optimization, but that’s kinda long so we’ll do what everyone else does and call it GRPO.
GRPO is a reinforcement learning algorithm designed to make LLMs better at reasoning. So yes, as the creator of the Reasoning Models Substack you can bet I’m a big GRPO fan…but I’m not alone here.
What makes GRPO so special is the fact that it dramatically reduces both memory and compute requirements.
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm to improve the reasoning capabilities of LLMs. It was introduced in the DeepSeekMath paper in the context of mathematical reasoning. GRPO modifies the traditional Proximal Policy Optimization (PPO) by eliminating the need for a value function model. Instead, it estimates baselines from group scores, reducing memory usage and computational overhead. GRPO, now also used by the Qwen team, can be used with rule/binary-based Rewards as well as General Reward Models to improve models on helpfulness. (Source - https://www.philschmid.de/deepseek-r1)
To build R1, DeepSeek essentially started with V3 as their base model (V3 is just an existing LLM of theirs), and then used GRPO to massively optimize the model. This is essentially done through two mechanisms, accuracy rewards and format rewards. I won’t go deeper into these two because I’m trying to make this as readable as possible to a broad(ish) audience…but just know that they’re essentially optimizing a model, and using two different reward systems to ensure those optimizations are done correctly.
Now let’s jump back to Alexandr’s tweet, the second half of it, because this is the real money shot and what I’m going to end my deep dive with, i.e. the really good stuff ⬇️
The major technological breakthrough in DeepSeek R1, is one that is already foundationally changing how companies build LLMs. They proved that for reasoning, you don’t need to have the most SFT data in the world, what you need is a lot of reasoning data, which at the end of the day is human data, highly optimized through GRPO, i.e. a new reward system that massively amplifies the value of this reasoning data.
SFT caps model performance while RLHF actually allows the model to continue to improve, a lot like the human brain does. Going back to my Star Trek example; while I can give you a huge labeled dataset with information about Star Trek, that isn’t going to make you a space expert. But, if I give you the tools you need to learn about space, and to learn faster than most people, then that changes everything.
So the real breakthrough that DeepSeek made is how models can improve all on their, as long as human data is used in creating the model, and the reward system in the model. Got all that?
I’m going to end it here but I do always want to leave you with some additional reading in case you really want to geek out.
Breaking down the DeepSeek-R1 training process—no PhD required
Detailed explanation of DeepSeek-R1 method: pure reinforcement learning and self-evolving behavior (warning - very math heavy!)
Thanks for reading and I’ll see you next week!