Browse Models
The DeepSeek VL2 family represents a significant advancement in vision-language modeling, introduced by Deepseek AI in December 2024. The family consists of three models of increasing size and capability: DeepSeek VL2 Tiny with 1.0B parameters, DeepSeek VL2 Small with 2.8B parameters, and DeepSeek VL2 with 4.5B parameters. These models build upon their predecessor, DeepSeek-VL, incorporating innovative architectural improvements and training methodologies as detailed in their comprehensive research paper.
All models in the DeepSeek VL2 family share a common architectural foundation based on the LLaVA-style design. This architecture comprises three primary components: a SigLIP-SO400M-384 vision encoder, a vision-language adaptor, and a DeepSeekMoE Large Language Model. A distinguishing feature across the family is the implementation of a dynamic tiling vision encoding strategy, which efficiently processes high-resolution images by dividing them into manageable tiles. This approach has proven particularly effective for tasks requiring ultra-high resolution processing, such as visual grounding and document analysis.
The models utilize the Multi-head Latent Attention (MLA) mechanism, which compresses the Key-Value cache to achieve more efficient inference and higher throughput. This technical advancement, combined with the Mixture-of-Experts (MoE) architecture, enables the DeepSeek VL2 family to achieve state-of-the-art performance while maintaining computational efficiency. All variants support a maximum sequence length of 4096 tokens, allowing for extensive context processing in multimodal applications.
The progression within the DeepSeek VL2 family demonstrates a clear scaling strategy, with each model offering increased capabilities while maintaining the core architectural benefits. The DeepSeek VL2 Tiny, as the entry-level model, provides an efficient solution for organizations with limited computational resources while still delivering competitive performance across standard vision-language tasks. The DeepSeek VL2 Small represents a middle ground, offering enhanced capabilities with its 2.8B parameters while remaining deployable on single GPUs with 40GB of memory. The flagship DeepSeek VL2 model, with its 4.5B parameters, delivers the highest performance in the family while still maintaining reasonable computational requirements compared to other state-of-the-art models.
The entire family underwent a comprehensive three-stage training process, representing a significant investment in model development. The first stage focused on vision-language alignment using ShareGPT4V data. The second stage involved extensive vision-language pretraining on a diverse dataset combining interleaved image-text data from sources such as WIT, WikiHow, OBELICS, and Wanjuan, along with specialized data for image captioning, OCR, visual question answering, and visual grounding. The final stage consisted of supervised fine-tuning using various task-specific datasets to enhance performance across different applications.
The DeepSeek VL2 family excels across a wide range of multimodal tasks, with particular strength in visual question answering, optical character recognition, document understanding, and visual grounding. All models in the family support multi-image processing, with the ability to handle up to two images using tiling and three or more images through direct padding to 384x384 resolution. The models have demonstrated exceptional performance across numerous benchmarks, including DocVQA, ChartQA, InfoVQA, TextVQA, OCRBench, AI2D, MMMU, and others, as documented in their technical documentation.
Each model in the family maintains competitive performance relative to its size class, with the larger variants predictably achieving higher benchmark scores. The models have been extensively evaluated across various benchmarks, consistently demonstrating strong performance in document understanding, chart interpretation, and visual question answering tasks. For optimal generation quality, all models in the family are recommended to use a temperature setting of T ≤ 0.7 during inference, as detailed in their implementation guidelines.
The DeepSeek VL2 family was initially released on December 13th, 2024, marking a significant milestone in vision-language modeling. On December 25th, 2024, the family received notable updates including a Gradio demo implementation, incremental prefilling capabilities, and VLMEvalKit support, enhancing their utility and accessibility for researchers and developers. These updates demonstrate Deepseek AI's commitment to maintaining and improving the model family's capabilities and user experience.
The DeepSeek VL2 family employs a dual licensing structure that balances open-source accessibility with commercial viability. The code repository is available under the MIT License, while the model weights are governed by the DeepSeek Model License. This licensing structure supports both research and commercial applications, making the models accessible for a wide range of use cases while maintaining appropriate usage guidelines and restrictions. The models and their associated resources are available through the official GitHub repository and Hugging Face.