Browse Models
The simplest way to self-host DeepSeek VL2. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
DeepSeek VL2 is a vision-language MoE model family (1.0B-4.5B parameters) that combines SigLIP vision encoding with a novel tiling strategy for handling multiple images. It excels at visual QA, OCR, and document understanding through its Multi-head Latent Attention mechanism and three-stage training approach.
DeepSeek VL2 represents a significant advancement in vision-language modeling, introducing a series of large Mixture-of-Experts (MoE) Vision-Language Models that build upon its predecessor, DeepSeek-VL. The model family consists of three variants with varying capabilities and parameter counts, designed to accommodate different computational requirements while maintaining high performance standards.
The architecture follows a LLaVA-style design, incorporating three main components: a SigLIP-SO400M-384 vision encoder, a vision-language adaptor, and a DeepSeekMoE Large Language Model. A key innovation is the implementation of a dynamic tiling vision encoding strategy, which efficiently handles high-resolution images with varying aspect ratios. This approach divides high-resolution images into tiles for processing by the vision encoder, significantly improving performance on tasks requiring ultra-high resolution processing, such as visual grounding and document analysis.
The model utilizes the Multi-head Latent Attention (MLA) mechanism, which compresses the Key-Value cache for more efficient inference and higher throughput. This technical advancement, combined with the MoE architecture, allows DeepSeek VL2 to achieve competitive or state-of-the-art performance while using similar or fewer activated parameters compared to other open-source models, as detailed in the research paper.
The DeepSeek VL2 family includes three distinct variants:
All variants maintain a 4096 sequence length and excel in various tasks, including:
The training process involved three comprehensive stages:
Vision-language alignment using ShareGPT4V
Vision-language pretraining on a diverse dataset combining:
Supervised fine-tuning using various datasets covering tasks from general VQA to visual grounding
The model demonstrates exceptional performance across multiple benchmarks, including DocVQA, ChartQA, InfoVQA, TextVQA, OCRBench, AI2D, MMMU, MMStar, MathVista, MME, MMBench, MMBench-V1.1, MMT-Bench, and RealWorldQA. For optimal generation quality, a temperature setting of T ≤ 0.7 is recommended.
The DeepSeek VL2 ecosystem is distributed under a dual licensing structure:
Commercial use is supported under these licensing terms. The models were initially released on December 13th, 2024, with additional features including a Gradio demo, incremental prefilling, and VLMEvalKit support added on December 25th, 2024.