Browse Models

Deepseek AI /

DeepSeek VL2 Tiny

Family

DeepSeek VL2

Type

Foundation Model

License

DeepSeek License

Released

2024-12-13

How To Use

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run DeepSeek VL2 Tiny using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

open-webui /

Open WebUI

Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.

oobabooga /

Text Generation Web UI

The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.

Model Report

Deepseek AI / DeepSeek VL2 Tiny

DeepSeek VL2 Tiny is a vision-language model from Deepseek AI that activates 1.0 billion parameters using Mixture-of-Experts architecture. The model combines a SigLIP vision encoder with a DeepSeekMoE-based language component to handle multimodal tasks including visual question answering, optical character recognition, document analysis, and visual grounding across images and text.

Explore the Future of AI

Your server, your data, under your control

DeepSeek-VL2-Tiny is a vision-language model developed by DeepSeek AI as part of the DeepSeek-VL2 series of Mixture-of-Experts (MoE) models. Designed for multimodal tasks, DeepSeek-VL2-Tiny integrates advanced visual and language understanding within an efficient, scalable architecture. Released in December 2024, the model enables applications including visual question answering, optical character recognition, document analysis, and visual grounding, and is positioned alongside variants like DeepSeek-VL2-Small and DeepSeek-VL2.

Scatter plot showing DeepSeek-VL2-Tiny and related models positioned among vision-language models

Model Architecture

DeepSeek-VL2-Tiny is constructed using a modular architecture that combines a vision encoder, a visual-language adaptor, and a Mixture-of-Experts language model. The vision encoder utilizes SigLIP-SO400M-384 to process input images, employing a dynamic tiling strategy to efficiently handle images of varying sizes and resolutions. This tiling segments high-resolution images into local tiles (typically 384x384 pixels) and generates a global thumbnail tile, enabling the model to extract both detailed and contextual visual information. The encoded visual data is compressed using a two-stage transformation—first through a 2x2 pixel shuffle, reducing each tile's spatial embedding, then via a multilayer perceptron—which aligns the visual representation with the language model's embedding space.

The language component of DeepSeek-VL2-Tiny builds on DeepSeekMoE, incorporating multi-head latent attention to optimize inference efficiency by compressing the key-value cache into latent vectors. This approach enables sparse computation during training and inference, and a global bias term is employed to balance computational loads across experts. The LLaVA-style decoder-only architecture facilitates consistent joint processing of vision-language inputs.

Parameterization and Variant Positioning

Within the DeepSeek-VL2 family, DeepSeek-VL2-Tiny activates 1.0 billion parameters, with a base language model activating 0.57 billion out of a total of 3 billion parameters. This positions it as a highly parameter-efficient alternative to larger models in the series, such as DeepSeek-VL2-Small and the standard DeepSeek-VL2, which feature more activated and total parameters and thus scale up in both performance and computational resource requirements. All members of the DeepSeek-VL2 series maintain a sequence length of 4096 tokens, supporting complex multi-modal interactions.

Benchmark Performance

DeepSeek-VL2-Tiny achieves competitive results on a range of publicly recognized benchmarks relative to its compact size. In optical character recognition, the model reaches 809 on OCRBench, 88.9 on DocVQA, 81.0 on ChartQA, 66.1 on InfoVQA, and 80.7 on TextVQA. For general question answering and mathematics, it achieves scores including 45.9 on MMStar, 71.6 on AI2D, and 53.6 on MathVista (testmini), among others. On visual grounding tasks, DeepSeek-VL2-Tiny attains 84.7 (val) and up to 87.8 (testA) on RefCOCO, alongside strong results on RefCOCO+ and RefCOCOg. Collectively, these outcomes place the model favorably against other open-source dense and MoE-based vision-language models at comparable scales, as detailed in the technical report.

Training Data and Methods

Training of DeepSeek-VL2-Tiny utilizes a three-stage process. Initially, a vision-language alignment phase focuses on calibrating the multimodal connectors, relying on datasets such as ShareGPT4V with approximately 1.2 million caption and conversation samples. The main pretraining stage incorporates both vision-language and text-only data, drawing from major datasets like WIT, WikiHow, and OBELICS, supplemented with in-house and Chinese content. Dedicated subsets address OCR, visual question answering, document and chart understanding, as well as visual grounding, combining open-source resources such as LaTeX OCR, 12M RenderedText, PubTabNet, and FinTabNet, with extensive proprietary datasets for both English and Chinese text.

The third training stage, supervised fine-tuning, emphasizes instruction-following and conversational responsiveness by integrating in-house data and curated multimodal supervision. Data curation routines include cleaning, regenerating, and quality-controlling responses for public datasets, as well as constructing specialized Chinese QA datasets to complete language coverage. Throughout the process, training leverages optimization techniques such as tensor and expert parallelism and employs the HAI-LLM platform to facilitate the complex GPU workloads necessary for handling large-scale multimodal data.

Applications and Model Scope

DeepSeek-VL2-Tiny supports a broad array of multimodal use cases. Its primary applications include visual question answering, OCR, document, table, and chart analysis, as well as visual grounding—tasks central to document intelligence, knowledge extraction, and perception in artificial agents. Beyond these, the model is applicable to chatbot interactions involving graphical user interfaces, meme interpretation, multi-image dialogue, and visual storytelling scenarios. It is also designed to be adaptable for use in embodied AI and agentic applications that require visual grounding to inform their environment understanding. As with most vision-language models of its size, there are limitations in handling images with extreme blurriness, rare objects, or complex creative storytelling formats.

Licensing and Accessibility

DeepSeek-VL2-Tiny is distributed under the MIT License, with models themselves governed by the DeepSeek Model License. These provisions allow for commercial use and encourage transparency for both research and application development.

DeepSeek VL2 Tiny

Laboratory OS

Direct Download

Open WebUI

Text Generation Web UI

Explore the Future of AI

Your server, your data, under your control

DeepSeek VL2 Tiny

Laboratory OS

Direct Download

Open WebUI

Text Generation Web UI

Explore the Future of AI

Your server, your data, under your control

Model Architecture

Parameterization and Variant Positioning

Benchmark Performance

Training Data and Methods

Applications and Model Scope

Licensing and Accessibility

External Resources