Browse Models
The simplest way to self-host DeepSeek V2.5. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
DeepSeek V2.5 is a 236B parameter MoE model that activates only 21B parameters per token. It features Multi-head Latent Attention for efficient KV caching and supports 128K context. Trained on 8.1T tokens, it excels at coding and math tasks while offering strong Chinese language capabilities.
DeepSeek-V2.5 represents a significant advancement in large language model architecture, combining efficient design with powerful capabilities. This 236B parameter Mixture-of-Experts (MoE) model achieves state-of-the-art performance while maintaining computational efficiency through innovative architectural choices, as detailed in the technical report.
The model's architecture incorporates two key innovations: Multi-head Latent Attention (MLA) and DeepSeekMoE. While the total parameter count is 236B, only 21B parameters are activated per token, significantly reducing computational requirements. The MLA mechanism compresses the Key-Value (KV) cache into a latent vector, enabling more efficient inference.
Compared to its predecessor, DeepSeek 67B, the V2.5 model achieves remarkable efficiency improvements:
DeepSeek-V2.5 underwent extensive training on a high-quality, multi-source corpus containing 8.1 trillion tokens, with increased Chinese content compared to previous versions. The training process included:
The model supports a context length of 128K tokens and demonstrates strong performance across varying context windows, as shown in the Needle in a Haystack benchmark results:
DeepSeek-V2.5 achieves impressive results across various benchmarks, particularly in coding and mathematical tasks. The model outperforms many larger models while using fewer activated parameters:
Comparative performance against other leading models shows:
The model family also includes DeepSeek-V2-Lite, a smaller 15.7B parameter variant with 2.4B activated parameters, designed for research purposes. The model's code is available under the MIT license, while the model itself has a separate license allowing for commercial applications.