Browse Models

Tencent /

HunyuanVideo

Family

HunyuanVideo

Type

Foundation Model

License

Tencent Hunyuan Community License Agreement

Released

2024-12-03

How To Use

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run HunyuanVideo using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

comfyanonymous /

ComfyUI

Generate images and videos using a powerful low-level workflow graph builder - the fastest, most flexible, and most advanced visual generation UI.

Model Report

Tencent / HunyuanVideo

HunyuanVideo is an open-source video generation model developed by Tencent that supports text-to-video, image-to-video, and controllable video synthesis. The model employs a Transformer-based architecture with a 3D Variational Autoencoder and utilizes flow matching for generating videos at variable resolutions and durations. It features 13 billion parameters and includes capabilities for avatar animation, audio synchronization, and multi-aspect ratio output generation.

Explore the Future of AI

Your server, your data, under your control

HunyuanVideo is an open-source generative video foundation model developed by Tencent, designed to enable text-to-video, image-to-video, and controllable video synthesis through multimodal techniques. The model is intended to bridge performance differences between closed-source and open-source solutions by offering a framework for scalable, high-fidelity video generation. HunyuanVideo leverages a unified Transformer architecture, variational autoencoding, and data curation methods to demonstrate capabilities as detailed in its associated research, supporting variable resolution and video length across a wide range of visual and contextual scenarios.

Demonstration of HunyuanVideo’s capabilities, highlighting the model’s ability to synthesize videos that integrate virtual and real elements. [Source]

Model Architecture

HunyuanVideo is built around a large-scale Transformer-based architecture, supporting both image and video generation through a dual-stream to single-stream hybrid design. Video and text tokens are initially processed in separate streams, allowing for focused feature extraction, before being concatenated in a unified, single-stream stage that promotes multimodal information fusion. This framework is grounded in flow matching, enabling transformation between simple and complex probability distributions for generative capacity.

A core component of HunyuanVideo is the 3D Variational Autoencoder (VAE), which leverages CausalConv3D layers for encoding and decoding spatiotemporal data. This mechanism compresses input pixel-space videos and images to a compact latent space, with reductions in temporal, spatial, and channel dimensions. The VAE is trained jointly on videos and images, using a composite loss that combines L1, perceptual, GAN adversarial, and KL divergence terms, thereby enabling efficient high-resolution video synthesis while maintaining temporal and visual coherence.

Diagram of HunyuanVideo's 3D VAE architecture

Unique within HunyuanVideo is the use of a pre-trained Multimodal Large Language Model (MLLM) with a decoder-only structure as the text encoder. This approach provides enhanced alignment between image and text, improved ability for detailed visual description, and supports reasoning in prompt interpretation. An additional bidirectional token refiner augments text encoding, and rotary position embedding (RoPE) is extended to three dimensions to better model time, height, and width relationships within video sequences.

Training Data and Methods

HunyuanVideo’s training methodology places emphasis on data quality and structural annotation. The dataset is curated through a hierarchical pipeline involving deduplication, motion, OCR, clarity, and aesthetic filtering, resulting in progressively refined video and image sets. Videos are organized into five groups, while images are grouped into two, supporting a mixed training regime.

Captions for the training data are enhanced using an in-house vision language model, producing multi-faceted JSON-format structured descriptions including short and dense captions, background, style, shot type, lighting, and atmosphere. Camera motion is systematically annotated via a classifier trained to recognize 14 types of camera movements, feeding high-confidence labels into the training captions for explicit motion conditioning.

A mix of joint image-video training, progressive scaling from low to high resolution, and targeted fine-tuning on curated, high-quality subsets is used to further improve motion fidelity and visual realism. The model parameters, totaling 13 billion for the foundation model, are selected through empirical investigation of model scaling laws for trade-off between quality and efficiency.

Capabilities and Applications

HunyuanVideo can synthesize realistic, cinematic-level videos from text prompts (text-to-video), extend static images into coherent video sequences (image-to-video), and generate animated avatars with precise synchronization to audio, pose, or facial expression inputs. The architecture supports multi-resolution and multi-aspect ratio outputs, as well as varying durations, all facilitated by the underlying 3D VAE and attention mechanisms. Notably, the model demonstrates the ability to produce high dynamic content, continuous action sequences, camera work, concept generalization beyond training examples, and renders with physical plausibility.

Sample output from HunyuanVideo: Woman at the beach

The model features a dedicated prompt rewrite module, which interprets user queries and adapts them to optimize scene composition and camera dynamics, with both ‘normal’ and ‘master’ modes to balance semantic fidelity and visual richness. Inference efficiency is further boosted by time-step shifting and text-guidance distillation strategies. A video-to-audio (V2A) module enables synchronized sound effects and background music generation, enhancing the realism of outputs and broadening potential applications.

Demonstration of the model's ability to reproduce a variety of facial expressions for cartoon characters, illustrating controllable expression-driven animation. [Source]

Sample showcasing accurate pose-following and identity preservation in generated videos, useful for avatar animation tasks. [Source]

Video demonstrating the precise modeling of subtle facial expressions in generated characters. [Source]

Example of the model's generalizability, featuring the animation of cultural relics and artifacts. [Source]

Applications span creative content generation, animated character production, portrait and talking avatar synthesis, and video enhancement or dubbing. The model's capabilities extend to audio-driven, pose-driven, and hybrid condition-driven avatar animation, facilitating synchronized, expressive, and editable digital characters.

Evaluation and Benchmarks

HunyuanVideo has been evaluated against several closed-source and open-source video generation systems using over 1,500 standardized prompts. Professional assessors rated generated videos on text alignment, motion quality, and visual fidelity. The model achieved scores reflecting its performance in overall evaluation, motion quality, and visual quality domains, as documented in its associated research, including comparative assessments against systems such as Runway Gen-3 and Luma 1.6.

Inference is supported across variable resolutions and aspect ratios, with optimizations such as unified sequence parallelism for scalable multi-GPU deployment, efficient reduced-precision (FP8) support for memory savings, and a flexible tiling strategy for long or high-resolution videos.

Limitations and Future Directions

While HunyuanVideo exhibits strong performance, several limitations are recognized. The released fast version may differ from benchmarked high-quality versions, especially regarding fidelity. The model’s master mode, designed for enhanced compositional control, may sometimes sacrifice semantic specificity in the output. Training-to-inference discrepancies in the VAE tiling strategy can produce visual artifacts, partially addressed by random tiling during fine-tuning. The video-to-audio module faces challenges in filtering and aligning diverse audio streams. The team indicates ongoing research into progressive scaling strategies from low to high resolution and further work in optimizing resource efficiency and fidelity.

Related Models

HunyuanVideo shares its ecosystem with several related models. HunyuanVideo-I2V extends capabilities toward image-to-video synthesis and is fine-tuned for specialized domains such as portrait generation. The Hunyuan-Large Model forms the foundation for prompt rewriting and advanced instruction following within the pipeline.

Licensing and Availability

The HunyuanVideo model, codebase, and pre-trained weights are available under the terms specified in the project’s LICENSE.txt file. Documentation, additional technical details, and pretrained weights can be accessed through official Tencent and Hugging Face repositories.

HunyuanVideo

Laboratory OS

Direct Download

ComfyUI

Explore the Future of AI

Your server, your data, under your control

HunyuanVideo

Laboratory OS

Direct Download

ComfyUI

Explore the Future of AI

Your server, your data, under your control

Model Architecture

Training Data and Methods

Capabilities and Applications

Evaluation and Benchmarks

Limitations and Future Directions

Related Models

Licensing and Availability

Helpful Links