Browse Models
The simplest way to self-host HunyuanVideo. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
HunyuanVideo is Tencent's text-to-video model featuring a unique dual-to-single-stream Transformer architecture. It uses a multimodal LLM decoder for text processing and a 3D VAE with CausalConv3D for efficient video compression. The model excels at maintaining consistent motion across generated video frames.
HunyuanVideo is a large open-source video generation model developed by Tencent, released in December 2024. The model represents a significant advancement in text-to-video generation, offering capabilities comparable or superior to leading closed-source models according to the technical report.
HunyuanVideo employs a sophisticated "Dual-stream to Single-stream" hybrid Transformer design that processes video and text tokens independently before concatenating them for multimodal fusion. This architecture leverages a full attention mechanism for unified image and video generation.
The model's text encoder utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only structure, chosen for its superior capabilities in image-text alignment, detail description, and complex reasoning compared to traditional approaches like CLIP and T5-XXL.
A key component is the 3D Variational Autoencoder (VAE) with CausalConv3D, which compresses videos and images into a latent space, significantly reducing the computational requirements for the diffusion transformer. The model also incorporates a prompt rewrite component, fine-tuned using Tencent's Hunyuan-Large model, which enhances generation quality by optimizing user prompts.
Human evaluations demonstrate that HunyuanVideo outperforms several leading closed-source models, including Runway Gen-3 and Luma 1.6, particularly in motion quality. The model supports various video resolutions (540p and 720p) and frame rates, with two operational modes: Normal and Master, each optimized for different prompt styles.
The model excels in generating high-quality videos with natural-looking movement and precise expression, making it suitable for content creation in advertising and film industries. It demonstrates particular strength in maintaining temporal consistency and handling complex motion sequences.
For inference, HunyuanVideo requires an NVIDIA GPU with CUDA support, with minimum requirements of:
The model supports parallel processing through the xDiT framework, enabling efficient use of multiple GPUs and leveraging Unified Sequence Parallelism (USP) APIs. FP8 quantized weights are available, reducing VRAM requirements by approximately 10GB.
Inference can be performed via command line interface or through a Gradio server, with various parameters available for customizing video size, length, generation steps, and guidance scales. The project includes comprehensive installation instructions and a pre-built Docker image for enhanced accessibility.
The entire codebase, including model weights, inference code, and associated materials, is publicly available on GitHub and Hugging Face. The community has already contributed several expansions, including integrations with platforms like ComfyUI and specialized models for improved video quality.