Browse Models
The simplest way to self-host DeepSeek VL2 Tiny. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
DeepSeek VL2 Tiny is a 1.0B parameter vision-language model using Mixture-of-Experts architecture with SigLIP vision encoding and Multi-head Latent Attention. It excels at OCR tasks and handles up to 4096 tokens, using dynamic tiling for image processing. Trained on diverse datasets for visual Q&A and image analysis.
DeepSeek VL2 Tiny is a Mixture-of-Experts (MoE) Vision-Language Model that represents the most compact variant in the DeepSeek-VL2 model family. With 1.0B activated parameters, it offers an efficient solution for multimodal understanding tasks while maintaining competitive performance. The model is part of a family that includes DeepSeek VL2-Small (2.8B activated parameters) and the full DeepSeek-VL2 model (4.5B activated parameters), as detailed in the research paper.
The model's architecture combines a SigLIP-SO400M-384 vision encoder with a DeepSeekMoE language model, utilizing a Multi-head Latent Attention (MLA) mechanism for efficient inference. A notable feature is its dynamic tiling vision encoding strategy, which enables processing of high-resolution images with varying aspect ratios by dividing them into manageable tiles. For multiple image processing, the model employs different strategies: up to two images use dynamic tiling, while three or more images are padded to 384x384 without tiling. The model supports a sequence length of 4096, making it suitable for complex multimodal tasks.
DeepSeek VL2 Tiny was trained on a diverse vision-language dataset through multiple stages, including VL alignment, VL pretraining, and supervised fine-tuning. The training data encompasses various sources such as ShareGPT4V, WIT, WikiHow, OBELICS, and Wanjuan, along with specialized datasets for image captioning, OCR, visual question answering, and visual grounding. This comprehensive training approach enables the model to excel in:
The model demonstrates competitive performance compared to other open-source models of similar size, particularly in OCR-related tasks and visual grounding. For optimal generation quality, it's recommended to use a temperature (T) ≤ 0.7 during sampling. The model's implementation requires Python 3.8 or higher and utilizes the transformers
library, as detailed in the GitHub repository.
When comparing the variants within the DeepSeek-VL2 family, performance generally improves with model size, though DeepSeek VL2 Tiny maintains competitive capabilities while requiring significantly fewer computational resources than its larger siblings. This makes it an attractive option for applications where resource efficiency is a priority.
The model was released on December 13, 2024, with additional features including a Gradio demo and incremental prefilling support added on December 25, 2024. The code is available under the MIT License, while the model itself is governed by the DeepSeek Model License. Commercial use is supported under these licensing terms.