DeepSeek VL2 Tiny

Family

DeepSeek VL2

Type

Foundation Model

License

DeepSeek License

Released

2024-12-13

How To Use

Lab Station OS

The simplest way to self-host DeepSeek VL2 Tiny. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.

Direct Download

Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

open-webui /

Open WebUI

Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.

oobabooga /

Text Generation Web UI

The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.

Model Report

Deepseek AI / DeepSeek VL2 Tiny

DeepSeek VL2 Tiny is a 1.0B parameter vision-language model using Mixture-of-Experts architecture with SigLIP vision encoding and Multi-head Latent Attention. It excels at OCR tasks and handles up to 4096 tokens, using dynamic tiling for image processing. Trained on diverse datasets for visual Q&A and image analysis.

Explore the Future of AI

Your server, your data, under your control

DeepSeek VL2 Tiny is a Mixture-of-Experts (MoE) Vision-Language Model that represents the most compact variant in the DeepSeek-VL2 model family. With 1.0B activated parameters, it offers an efficient solution for multimodal understanding tasks while maintaining competitive performance. The model is part of a family that includes DeepSeek VL2-Small (2.8B activated parameters) and the full DeepSeek-VL2 model (4.5B activated parameters), as detailed in the research paper.

Architecture and Technical Specifications

The model's architecture combines a SigLIP-SO400M-384 vision encoder with a DeepSeekMoE language model, utilizing a Multi-head Latent Attention (MLA) mechanism for efficient inference. A notable feature is its dynamic tiling vision encoding strategy, which enables processing of high-resolution images with varying aspect ratios by dividing them into manageable tiles. For multiple image processing, the model employs different strategies: up to two images use dynamic tiling, while three or more images are padded to 384x384 without tiling. The model supports a sequence length of 4096, making it suitable for complex multimodal tasks.

Training and Capabilities

DeepSeek VL2 Tiny was trained on a diverse vision-language dataset through multiple stages, including VL alignment, VL pretraining, and supervised fine-tuning. The training data encompasses various sources such as ShareGPT4V, WIT, WikiHow, OBELICS, and Wanjuan, along with specialized datasets for image captioning, OCR, visual question answering, and visual grounding. This comprehensive training approach enables the model to excel in:

Visual question answering
OCR and text recognition
Document understanding
Table and chart interpretation
Visual grounding
Multi-image conversation
Visual storytelling

Performance and Practical Usage

The model demonstrates competitive performance compared to other open-source models of similar size, particularly in OCR-related tasks and visual grounding. For optimal generation quality, it's recommended to use a temperature (T) ≤ 0.7 during sampling. The model's implementation requires Python 3.8 or higher and utilizes the transformers library, as detailed in the GitHub repository.

When comparing the variants within the DeepSeek-VL2 family, performance generally improves with model size, though DeepSeek VL2 Tiny maintains competitive capabilities while requiring significantly fewer computational resources than its larger siblings. This makes it an attractive option for applications where resource efficiency is a priority.

Licensing and Availability

The model was released on December 13, 2024, with additional features including a Gradio demo and incremental prefilling support added on December 25, 2024. The code is available under the MIT License, while the model itself is governed by the DeepSeek Model License. Commercial use is supported under these licensing terms.

Reference Links

DeepSeek-VL2 Research Paper - Comprehensive technical details and methodology
GitHub Repository - Source code and implementation details
Hugging Face Model Page - Model downloads and documentation
MIT License - Code licensing information
DeepSeek Model License - Model usage terms
DeepSeek Homepage - Official company website
DeepSeek AI Discord - Community support and discussions