Browse Models
The simplest way to self-host Wan 2.1 T2V 1.3B. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Wan 2.1 T2V 1.3B is a text-to-video model using Flow Matching and a custom 3D causal VAE architecture. The 1.3B parameter model generates 480P videos, with experimental 720P support. It processes multilingual text via T5 Encoder and can create a 5-second video in around 4 minutes.
Wan 2.1 T2V 1.3B represents a significant advancement in video generation AI models, offering high-quality text-to-video capabilities while maintaining reasonable hardware requirements for consumer use. As part of the broader Wan 2.1 family of models, it demonstrates how efficient architecture design can deliver compelling results even with modest model sizes.
The model is built on a sophisticated architecture combining multiple innovative components. At its core, Wan 2.1 utilizes the Flow Matching framework within the Diffusion Transformers paradigm. A key architectural element is the T5 Encoder, which handles multilingual text input, with cross-attention mechanisms in each transformer block embedding text into the model structure.
One of the most notable technical innovations is the Wan-VAE, a novel 3D causal VAE architecture specifically designed for video generation. This component enables efficient spatio-temporal compression while maintaining temporal causality, allowing the model to handle unlimited-length 1080P videos during encoding and decoding processes.
The model's training process involved a carefully curated and deduplicated dataset of images and videos. The data preparation followed a rigorous four-step cleaning process that focused on fundamental dimensions, visual quality, and motion quality.
The T2V-1.3B model demonstrates impressive performance metrics, particularly considering its relatively small size. It requires only 8.19 GB VRAM, making it accessible to users with consumer-grade GPUs. On an RTX 4090, it can generate a 5-second 480P video in approximately 4 minutes without requiring optimization techniques like quantization.
Computational efficiency across different hardware configurations has been thoroughly documented:
The Wan 2.1 family includes several models with different capabilities and requirements:
The T2V-1.3B model, while smaller than its 14B counterparts, maintains competitive performance while requiring significantly fewer computational resources. This makes it an excellent choice for users seeking to balance quality with hardware constraints.