Browse Models
Note: Qwen2.5 VL 72B weights are released under a Qwen Research License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host Qwen2.5 VL 72B. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Qwen2.5-VL 72B is a 73.4B parameter vision-language model that processes images, videos, and documents. It features dynamic resolution ViT, Window Attention, and Multimodal RoPE for efficient multimodal processing. Trained on 18T tokens, it excels at tasks like chart interpretation and video analysis up to hour-long clips.
Qwen2.5-VL 72B represents a significant advancement in vision-language modeling, building upon its predecessor Qwen2-VL. Released in January 2025, it demonstrates enhanced capabilities across various visual understanding tasks while introducing architectural improvements for better efficiency.
The model features a streamlined architecture that includes several key innovations. At its core is a novel dynamic resolution Vision Transformer (ViT) visual encoder, designed for optimal efficiency through Window Attention (limiting Full Attention to four layers) and architectural alignment with LLMs using RMSNorm and SwiGLU. The video encoder implements dynamic FPS sampling, extending dynamic resolution capabilities to the temporal dimension, enabling the model to process videos at varying frame rates and better understand temporal sequences.
The architecture incorporates Multimodal Rotary Position Embedding (M-RoPE) for improved fusion of positional information across modalities. For parameter storage, the model utilizes Safetensors format, containing 73.4B parameters in BF16 format. The official blog post details how these architectural choices contribute to both training and inference efficiency.
Qwen2.5-VL 72B demonstrates exceptional capabilities across multiple domains:
The model shows strong performance across numerous benchmarks, often outperforming its predecessor Qwen2-VL and competing with leading models like GPT-4o and Claude3.5 Sonnet. Notable benchmark results span image understanding (MMMU, MathVista, DocVQA), video comprehension (VideoMME, MMBench-Video), and agent tasks (ScreenSpot, Android Control).
The Qwen2.5-VL family includes three parameter sizes:
All variants were trained on approximately 18 trillion tokens, a significant increase from the 7 trillion tokens used in earlier versions. This expanded training contributes to improved reasoning, common sense, and expertise across the model family.
The model is integrated into the Hugging Face Transformers library and can be installed using:
pip install git+https://github.com/huggingface/transformers accelerate
For handling various visual inputs, the qwen-vl-utils
package is recommended:
pip install qwen-vl-utils[decord]
The model supports YaRN for efficient context window extension, though this may affect localization tasks. For long videos, increasing max_position_embeddings
is recommended.