Browse Models
The simplest way to self-host DeepSeek V2. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
DeepSeek-V2 is a 236B parameter MoE model (21B active per token) trained on 8.1T tokens, featuring Multi-head Latent Attention for efficient inference. Notable for Chinese/English bilingual capabilities, code generation, and math tasks. Includes a smaller 15.7B Lite variant. Training combined SFT and two-stage RL.
DeepSeek-V2 represents a significant advancement in large language model architecture, combining efficiency with powerful capabilities through its innovative Mixture-of-Experts (MoE) design. With 236B total parameters but only 21B activated per token, the model achieves remarkable performance while maintaining computational efficiency, as detailed in the technical report.
DeepSeek-V2's architecture introduces several key innovations that set it apart from traditional language models. The model employs Multi-head Latent Attention (MLA) for its attention mechanisms, which leverages low-rank key-value union compression to optimize inference-time key-value cache. This architectural choice contributes to a 93.3% reduction in KV cache size compared to its predecessor.
The model utilizes the DeepSeekMoE architecture for Feed-Forward Networks (FFNs), a high-performance MoE design that enables sparse computation for cost-effective training. This architecture allows DeepSeek-V2 to achieve stronger performance while reducing training costs by 42.5% and increasing maximum generation throughput by 5.76 times compared to the 67B parameter DeepSeek model.
DeepSeek-V2 underwent extensive pretraining on a diverse corpus of 8.1 trillion tokens, with a notably higher proportion of Chinese content compared to its predecessor. The training process included both Supervised Fine-Tuning (SFT) and a two-stage Reinforcement Learning (RL) approach. The RL training focused first on reasoning alignment using code and math data, followed by human preference alignment utilizing multiple reward models.
The model demonstrates impressive capabilities across various context window lengths up to 128K tokens, as evidenced by its performance on the Needle In A Haystack (NIAH) tests:
DeepSeek-V2 demonstrates strong performance across a wide range of benchmarks. On standard evaluations like MMLU, BBH, C-Eval, CMMLU, HumanEval, MBPP, and GSM8K, it achieves competitive scores compared to models like LLaMA3 70B and Mixtral 8x22B, with particular excellence in Chinese language tasks.
The RL-fine-tuned chat model variant shows particularly strong performance in English conversation generation, as demonstrated by AlpacaEval 2.0 and MTBench evaluations:
The model family also includes DeepSeek-V2-Lite, a smaller 15.7B parameter variant with 2.4B activated parameters per token, offering a more resource-efficient option while maintaining strong performance characteristics.
For local inference, DeepSeek-V2 requires 80GB of GPU memory per GPU, with 8 GPUs recommended for BF16 format. The model supports both the Hugging Face Transformers library and vLLM for deployment. The codebase is available under the MIT License, while the model itself (both base and chat variants) is subject to a separate Model License that permits commercial use.