Browse Models

Wan-AI /

Wan 2.1 T2V 1.3B

Family

Wan 2

Type

Foundation Model

License

Apache-2.0 License

Released

2025-02-25

How To Use

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run Wan 2.1 T2V 1.3B using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

comfyanonymous /

ComfyUI

Generate images and videos using a powerful low-level workflow graph builder - the fastest, most flexible, and most advanced visual generation UI.

Model Report

Wan-AI / Wan 2.1 T2V 1.3B

Wan 2.1 T2V 1.3B is an open-source text-to-video generation model developed by Wan-AI, featuring 1.3 billion parameters and utilizing a Flow Matching framework with diffusion transformers. The model supports multilingual text-to-video synthesis in English and Chinese, operates efficiently on consumer GPUs requiring 8.19 GB VRAM, and generates 480P videos with capabilities for image-to-video conversion and text rendering within videos.

Explore the Future of AI

Your server, your data, under your control

Wan 2.1 T2V 1.3B is an open-source text-to-video generative model developed as part of the Wan2.1 suite, which includes a range of AI models for video synthesis and editing. This model utilizes its architecture to enable efficient and high-quality video generation from textual prompts, supporting both English and Chinese languages. Designed for resource-efficient deployment, Wan 2.1 T2V 1.3B targets compatibility with consumer-grade GPUs. The model's performance has been evaluated across a variety of standardized benchmarks, and it incorporates approaches in video compression and text rendering within videos.

Architecture and Model Design

Wan 2.1 T2V 1.3B builds upon the diffusion transformer paradigm, following a Flow Matching framework for video generation. At its core lies a UMT5-based text encoder that processes multilingual prompts, with cross-attention mechanisms embedded into each transformer block for semantic integration. Temporal dynamics are represented using a multi-layer perceptron to generate specialized modulation parameters for each time step, contributing to temporal consistency and visual coherence. This architecture is complemented by the custom Wan-VAE—a three-dimensional causal variational autoencoder—which is responsible for spatio-temporal compression and efficient video encoding and decoding.

Video Diffusion Transformer Architecture

The model is configured with 1.3 billion parameters and features a dimensionality of 1536 across 30 transformer layers, 12 attention heads, and a feedforward dimension of 8960. Input and output dimensions accommodate sequences of video frames, allowing flexible video lengths. This design aims to support semantic alignment and video synthesis, while optimizing memory and compute efficiency.

Training Data and Data Curation Pipeline

The training of Wan 2.1 T2V 1.3B utilized a large-scale, heterogeneous dataset comprising both high-quality videos and images. A multi-stage data cleaning pipeline was employed. This process began with deduplication and filtering based on fundamental dimensions, followed by targeted quality assessments of visual, motion, and resolution characteristics. These measures ensured that only data meeting quality standards contributed to training, thereby supporting video clarity, visual characteristics, and textual integration.

The data corpus enabled the model to learn a broad spectrum of styles, objects, actions, and scene structures. The inclusion of extensive multilingual text-video pairs also supported the model’s ability to render text accurately within generated videos—a feature for applications requiring onscreen textual information.

Technical Innovations and Performance

The model's efficiency relies on the Wan-VAE, a novel video variational autoencoder. This 3D causal VAE architecture provides gains in spatio-temporal compression, reducing memory and compute requirements while preserving temporal structure. The VAE is capable of encoding and decoding high-resolution videos, including 1080P sequences of variable length, supporting both direct video synthesis and use as an intermediate representation for editing and transformation tasks.

In systematic benchmarking using the Wan-Bench evaluation suite, the T2V-1.3B model's performance was evaluated against comparable open-source models and proprietary solutions. Metrics assessed include motion generation, physical plausibility, image quality, scene diversity, stylization, and action following. The model's architecture and data processing pipeline are associated with its evaluated weighted scores across multiple test dimensions.

The model requires 8.19 GB of VRAM to generate a 5-second 480P video, enabling operation on widely available consumer GPUs. Execution time for this generation task is approximately 231 seconds on an RTX 4090, with further acceleration achievable through optimized distributed computation.

Applications and Use Cases

Wan 2.1 T2V 1.3B supports various creative and technical workflows. Its main capability is text-to-video generation, in which video content is synthesized directly from descriptive prompts. The model can also transform static images into dynamic video clips (image-to-video), perform advanced video editing, and execute complex generation tasks such as interpolating entire sequences from specified first and last frames (First-Last-Frame-to-Video, or FLF2V). With its integrated multilingual text rendering engine, the model allows for the generation of embedded English and Chinese text within videos.

The VACE pipeline enables combining text, video, mask, and image inputs for flexible video creation and editing. Wan 2.1 T2V 1.3B finds application in animation, content creation, storyboarding, educational media, and research on video-language models.

Demonstration video highlighting the text-to-video generation capabilities of Wan2.1-T2V-1.3B. [Source]

Black cat walking in a field of white flowers

Additional video demo illustrating diverse visual and motion capabilities of the Wan2.1 suite. [Source]

Model Variants, Benchmarks, and Limitations

The Wan2.1 suite features additional models—including T2V-14B, I2V-14B, FLF2V-14B, and VACE variants—that offer higher resolution (up to 720P), larger parameter counts, and expanded editing functionalities. T2V-1.3B supports 480P video synthesis, with 720P support available but stability may vary due to training data characteristics. For FLF2V tasks, results are typically achieved using Chinese prompts, reflecting training data characteristics.

Benchmark comparisons of Wan2.1 T2V 1.3B show its weighted scores and performance across evaluation categories.

Limitations include stability at 720P for the 1.3B parameter model, incomplete support for distributed inference or prompt extension within some deployment pipelines, and language preferences for specialized tasks such as first-last-frame generation.

Release, Licensing, and Access

Wan2.1 T2V 1.3B is released under the Apache 2.0 License, which permits free use, modification, and distribution. Users must ensure compliance with the license terms, which prohibit harmful or illegal use, dissemination of personal data, or the propagation of misinformation. The model weights, codebase, and documentation are publicly available, with releases periodically updated to expand capabilities across the Wan2.1 family.

AI-generated snowmobile on snow at sunrise

External Resources

For further information, code, documentation, and community resources, please refer to the following:

Further demonstration videos and model outputs are also available in the official video demo. For collaborative discussion, refer to the Discord community.