Browse Models
The simplest way to self-host DeepSeek VL2 Small. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
DeepSeek VL2 Small (2.8B parameters) is a vision-language model optimized for efficiency through MoE techniques. It features dynamic image tiling and Multi-head Latent Attention for improved processing. The model excels at visual QA, OCR, document analysis, and chart interpretation, supporting sequences up to 4096 tokens.
DeepSeek VL2 Small is a 2.8B parameter vision-language model that belongs to the DeepSeek VL2 family of models, which includes the smaller DeepSeek VL2-Tiny (1.0B parameters) and larger DeepSeek VL2 (4.5B parameters). The model is built on the DeepSeekMoE-16B architecture and utilizes a Mixture-of-Experts (MoE) approach with Multi-head Latent Attention (MLA) for efficient inference and high throughput.
The model incorporates two key architectural innovations that set it apart from its predecessor. First, it employs a dynamic tiling vision encoding strategy that efficiently handles high-resolution images with varying aspect ratios. For up to two input images, the model applies tiling to manage token count within the context window, while three or more images are padded to 384x384 without tiling. Each tile is processed by a shared SigLIP-SO400M-384 vision encoder.
The second key innovation is the implementation of the DeepSeekMoE language model with Multi-head Latent Attention. This mechanism compresses the Key-Value cache, resulting in improved inference speed and throughput. The model can process sequences up to 4096 tokens in length and can be deployed on a single GPU with 40GB of memory.
DeepSeek VL2 Small was trained through a three-stage process detailed in the original research paper. The training utilized an improved vision-language dataset that includes:
The model excels in various multimodal tasks, including:
DeepSeek VL2 Small demonstrates competitive or state-of-the-art performance across numerous benchmarks, often outperforming similar open-source models with comparable or fewer activated parameters. Notable benchmark performances include strong results on DocVQA, ChartQA, InfoVQA, TextVQA, OCRBench, AI2D, MMMU, MMStar, MathVista, MME, MMBench, MMBench-V1.1, and MMT-Bench.
For optimal performance, it's recommended to use a temperature (T) less than or equal to 0.7 during sampling, as higher temperatures can reduce generation quality. The model also supports incremental prefilling for memory optimization on GPUs with limited capacity.
The DeepSeek VL2 models are released under a dual licensing scheme: the code is available under the MIT License, while the model weights are governed by the DeepSeek Model License. Commercial use is permitted under these licenses.