DeepSeek VL2 Small

Family

DeepSeek VL2

Type

Foundation Model

License

DeepSeek License

Released

2024-12-13

How To Use

Lab Station OS

The simplest way to self-host DeepSeek VL2 Small. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.

Direct Download

Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

open-webui /

Open WebUI

Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.

oobabooga /

Text Generation Web UI

The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.

Model Report

Deepseek AI / DeepSeek VL2 Small

DeepSeek VL2 Small (2.8B parameters) is a vision-language model optimized for efficiency through MoE techniques. It features dynamic image tiling and Multi-head Latent Attention for improved processing. The model excels at visual QA, OCR, document analysis, and chart interpretation, supporting sequences up to 4096 tokens.

Explore the Future of AI

Your server, your data, under your control

DeepSeek VL2 Small is a 2.8B parameter vision-language model that belongs to the DeepSeek VL2 family of models, which includes the smaller DeepSeek VL2-Tiny (1.0B parameters) and larger DeepSeek VL2 (4.5B parameters). The model is built on the DeepSeekMoE-16B architecture and utilizes a Mixture-of-Experts (MoE) approach with Multi-head Latent Attention (MLA) for efficient inference and high throughput.

Architecture and Technical Details

The model incorporates two key architectural innovations that set it apart from its predecessor. First, it employs a dynamic tiling vision encoding strategy that efficiently handles high-resolution images with varying aspect ratios. For up to two input images, the model applies tiling to manage token count within the context window, while three or more images are padded to 384x384 without tiling. Each tile is processed by a shared SigLIP-SO400M-384 vision encoder.

The second key innovation is the implementation of the DeepSeekMoE language model with Multi-head Latent Attention. This mechanism compresses the Key-Value cache, resulting in improved inference speed and throughput. The model can process sequences up to 4096 tokens in length and can be deployed on a single GPU with 40GB of memory.

Training and Capabilities

DeepSeek VL2 Small was trained through a three-stage process detailed in the original research paper. The training utilized an improved vision-language dataset that includes:

Vision-language alignment data (ShareGPT4V)
Vision-language pretraining data (WIT, WikiHow, OBELICS, Wanjuan, and in-house data)
Supervised fine-tuning (SFT) data from various sources, including proprietary datasets focused on improving response quality

The model excels in various multimodal tasks, including:

Visual question answering
Optical character recognition (OCR)
Document, table, and chart understanding
Visual grounding
Multi-image conversation
Visual storytelling

Performance and Benchmarks

DeepSeek VL2 Small demonstrates competitive or state-of-the-art performance across numerous benchmarks, often outperforming similar open-source models with comparable or fewer activated parameters. Notable benchmark performances include strong results on DocVQA, ChartQA, InfoVQA, TextVQA, OCRBench, AI2D, MMMU, MMStar, MathVista, MME, MMBench, MMBench-V1.1, and MMT-Bench.

For optimal performance, it's recommended to use a temperature (T) less than or equal to 0.7 during sampling, as higher temperatures can reduce generation quality. The model also supports incremental prefilling for memory optimization on GPUs with limited capacity.

Licensing and Availability

The DeepSeek VL2 models are released under a dual licensing scheme: the code is available under the MIT License, while the model weights are governed by the DeepSeek Model License. Commercial use is permitted under these licenses.

References

Research Paper - Original paper describing the DeepSeek VL2 model family
GitHub Repository - Official code repository and implementation
Hugging Face Model - Model weights and documentation
DeepSeek Homepage - Official website and company information
Model License - License terms for model usage
Code License - MIT License for code repository