Browse Models
The simplest way to self-host DeepSeek R1. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
DeepSeek-R1 is a 671B parameter Mixture-of-Experts model that actively uses 37B parameters during inference. Notable for its multi-stage training combining reinforcement learning with "cold-start" Chain-of-Thought examples. Shows strong mathematical ability (79.8% on AIME) and supports 32K token contexts.
DeepSeek-R1 represents a significant advancement in large language model technology, built upon the DeepSeek-V3 architecture. At its core, it's a 671B parameter model that uses a Mixture-of-Experts (MoE) architecture with 37B activated parameters, allowing for efficient processing while maintaining extensive knowledge capacity.
The model was developed through an innovative multi-stage training pipeline that combines reinforcement learning (RL) with supervised fine-tuning (SFT). Unlike its predecessor DeepSeek-R1-Zero, which relied solely on RL training, DeepSeek-R1 incorporates "cold-start" data - consisting of thousands of Chain-of-Thought (CoT) examples - before beginning the RL phase. This approach was implemented to address issues observed in DeepSeek-R1-Zero, such as repetitive outputs and poor readability.
The training process involves multiple stages:
DeepSeek-R1 demonstrates exceptional performance across various benchmarks, achieving results comparable to OpenAI's o1-1217 model. Notable achievements include:
The model particularly excels in reasoning tasks, outperforming its predecessor DeepSeek-V3 across numerous benchmarks including MMLU, MMLU-Pro, and GPQA Diamond. More details about the model's performance can be found in the DeepSeek R1 paper.
The DeepSeek-R1 family includes several variants:
The distilled models represent a significant achievement in knowledge transfer, with the 32B Qwen-based version notably outperforming OpenAI's o1-mini across various benchmarks. Research has shown that directly distilling from DeepSeek-R1 produces better results than applying RL directly to smaller models.
The model supports a maximum generation length of 32,768 tokens. For optimal performance, recommended parameters include:
The model and its variants are released under the MIT License, permitting commercial use and modifications. However, users should note that the distilled models incorporate code originally licensed under Apache 2.0 (Qwen-2.5) and Llama 3.1/3.3 licenses.