Browse Models
The DeepSeek V2 model family represents a significant advancement in large language model technology, introduced by Deepseek AI in 2024. The family consists of three main branches: the general-purpose DeepSeek V2 and V2.5 models, and the specialized DeepSeek Coder V2 variants. All models in the family share a common architectural foundation based on the innovative Mixture-of-Experts (MoE) design, which enables efficient parameter utilization while maintaining powerful capabilities.
The DeepSeek V2 family introduces several groundbreaking architectural innovations that define its capabilities. The models employ Multi-head Latent Attention (MLA), which compresses the Key-Value cache into a latent vector for highly efficient inference. This approach results in a 93.3% reduction in KV cache size compared to previous models, as detailed in the technical report.
The family's distinctive feature is the DeepSeekMoE architecture, which implements fine-grained expert segmentation and shared expert isolation. This design allows the models to achieve strong performance while significantly reducing computational costs. For instance, the full-size models contain 236B total parameters but only activate 21B parameters per token, while the Lite variants contain 15.7B-16B parameters with only 2.4B activated parameters.
The family began with the release of DeepSeek V2 in early 2024, which established the foundation for the subsequent models. This was quickly followed by DeepSeek V2.5, which introduced refinements to the architecture while maintaining the same parameter count. Both models underwent extensive pretraining on 8.1 trillion tokens, with a notable emphasis on Chinese language content.
The specialized DeepSeek Coder V2 branch emerged from this foundation, with additional training on 6 trillion tokens focused on programming and mathematical content. This variant introduced support for 338 programming languages, a significant expansion from earlier versions. The DeepSeek Coder V2 Lite offers a more resource-efficient alternative while maintaining strong performance in coding tasks.
The DeepSeek V2 family demonstrates exceptional performance across various benchmarks. The general-purpose models excel in both English and Chinese language tasks, showing strong results on evaluations like MMLU, BBH, C-Eval, and CMMLU. The models support an extensive context length of 128K tokens, implemented through YaRN technology, enabling them to handle complex, long-form content effectively.
The Coder variants show particular strength in programming and mathematical reasoning tasks. According to the research paper, DeepSeek Coder V2 achieves state-of-the-art results among open-source models in code generation and demonstrates comparable performance to leading closed-source models in mathematical reasoning tasks.
Each model in the family undergoes a sophisticated training process that includes pretraining, supervised fine-tuning (SFT), and reinforcement learning optimization. The general-purpose models utilize 1.5M conversational sessions for SFT, while the Coder variants incorporate additional specialized training data focused on programming and mathematical content.
The models employ various training techniques, including Fill-In-Middle (FIM) training, cosine learning rate scheduling, and Group Relative Policy Optimization (GRPO). The Coder variants specifically incorporate compiler feedback and test cases during alignment to enhance their programming capabilities.
The full-size models in the family require substantial computational resources, typically needing 80GB of GPU memory per GPU, with eight GPUs recommended for optimal performance in BF16 format. The Lite variants offer more accessible alternatives, making the family's capabilities available to users with more modest computational resources.
All models in the family are available through the Hugging Face Transformers library and support deployment via vLLM. The code is released under the MIT License, while the models themselves are subject to a separate license that permits commercial use, as detailed in the model license agreement.
The DeepSeek V2 family represents a significant advancement in efficient, high-performance language models. Their innovative architecture has demonstrated that it's possible to achieve state-of-the-art results while maintaining computational efficiency through clever parameter activation strategies. The family's success in both general-purpose and specialized applications suggests a promising direction for future language model development, particularly in the realm of expert-based architectures and efficient parameter utilization.