Browse Models
The simplest way to self-host DeepSeek R1 Distill Qwen 7B. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
DeepSeek R1 Distill Qwen 7B is a distilled version of a 37B parameter model compressed to 7B parameters while maintaining strong reasoning capabilities. Trained on 800,000 curated samples, it builds on the Qwen2.5-Math-7B architecture and performs effectively with temperature settings of 0.5-0.7.
DeepSeek R1 Distill Qwen 7B represents a significant achievement in model distillation, successfully transferring the reasoning capabilities of the larger DeepSeek R1 model into a more computationally efficient 7B parameter variant. Built upon the Qwen2.5-Math-7B architecture, this model demonstrates how effective distillation techniques can create smaller models that maintain impressive performance across various benchmarks.
The model leverages the Qwen2.5-Math-7B architecture as its foundation. The distillation process involved fine-tuning on approximately 800,000 carefully curated samples generated by the larger DeepSeek-R1 model, which itself is a 37B parameter model activated from a 671B parameter architecture. This approach to distillation, as detailed in the DeepSeek R1 paper, demonstrates how reasoning capabilities can be effectively transferred from larger to smaller models.
The parent model, DeepSeek R1, was developed using a sophisticated pipeline incorporating two reinforcement learning (RL) stages and two supervised fine-tuning (SFT) stages. Unlike its sibling model DeepSeek R1-Zero (which used only RL), the training process began with "cold-start" data to prevent issues like endless repetition and poor readability.
DeepSeek R1 Distill Qwen 7B has demonstrated remarkable performance across several key benchmarks:
The model significantly outperforms the original 7B Qwen model and shows competitive results against larger models in the same class. For optimal performance, the model requires specific parameter settings:
Within the DeepSeek-R1 family, this 7B model compares favorably to its larger siblings (DeepSeek-R1-Zero and DeepSeek-R1, both 37B parameters), especially considering its significantly smaller size. The research shows that adding an additional RL stage to the distilled model could yield even further improvements in performance.
The model can be deployed locally using tools such as vLLM with appropriate temperature and maximum length settings. It's available under the MIT License, though it's important to note that its base model, Qwen2.5-Math-7B, is under the Apache 2.0 License.
Implementation best practices include:
For further technical details about the underlying architecture and implementation, developers can refer to the DeepSeek-V3 GitHub repository.