Browse Models
The DeepSeek R1 model family, introduced by Deepseek AI in January 2025, represents a significant advancement in large language model technology through its innovative approach to model distillation and reasoning capabilities. At the heart of the family is the DeepSeek R1, a 671B parameter model using a Mixture-of-Experts (MoE) architecture with 37B activated parameters. This flagship model serves as the foundation for a series of carefully crafted distilled variants ranging from 1.5B to 70B parameters.
The family's development began with DeepSeek R1 Zero, which relied solely on reinforcement learning for training. While this initial model showed promising reasoning capabilities, it suffered from issues such as repetitive outputs and poor readability. Learning from these limitations, researchers developed an improved training methodology for the main DeepSeek R1 model, incorporating a "cold-start" phase with thousands of Chain-of-Thought (CoT) examples before beginning the reinforcement learning stages, as detailed in the DeepSeek R1 paper.
This enhanced training pipeline, combining two stages of reinforcement learning (RL) with two stages of supervised fine-tuning (SFT), became the foundation for developing the entire model family. The success of this approach led to the creation of six distilled variants, built upon either the Qwen or Llama architectures, each optimized for different use cases and computational constraints.
The DeepSeek R1 family includes several key variants, each built upon different base architectures:
The Qwen-based models include:
The Llama-based models include:
The DeepSeek R1 family demonstrates exceptional performance across various benchmarks, with the flagship model achieving results comparable to OpenAI's o1-1217 model. Notable achievements include 79.8% Pass@1 on AIME 2024, 97.3% on MATH-500, and performance exceeding 96.3% of human participants on Codeforces.
The distilled variants maintain impressive capabilities relative to their size, with the 32B Qwen variant notably outperforming OpenAI's o1-mini across multiple benchmarks. Research has shown that directly distilling from DeepSeek R1 produces better results than applying RL directly to smaller models, as documented in the technical documentation.
All models in the family share several key technical characteristics:
The models use safetensors with BF16 parameters and are designed to work efficiently with modern deep learning frameworks. The architecture emphasizes both computational efficiency and reasoning capabilities, making the models particularly well-suited for mathematical and scientific applications.
The DeepSeek R1 family operates under a multi-tiered licensing structure, with the code released under the MIT License and model weights subject to specific agreements. The base models used for distillation carry their respective licenses - Apache 2.0 for Qwen-based models and Llama-specific licenses for Llama-based variants, as detailed in the official licensing documentation.
The DeepSeek R1 family represents a significant advancement in efficient AI model design, demonstrating that smaller, distilled models can maintain much of the reasoning capability of their larger counterparts. This breakthrough has important implications for the deployment of AI systems in resource-constrained environments and sets a new standard for model distillation techniques.
The success of the family's training methodology, particularly the combination of cold-start data with reinforcement learning, suggests promising directions for future model development. Ongoing research continues to explore the potential for further improvements in both the distillation process and the base architecture.
Through its comprehensive range of models and innovative training approach, the DeepSeek R1 family has established itself as a significant contributor to the field of large language models, offering solutions for various computational requirements while maintaining high standards of performance and reliability.