Browse Models
Note: CogVideoX 5B weights are released under a CogVideoX License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host CogVideoX 5B. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
CogVideoX-5B generates 10-second videos at 16fps with 768x1360 resolution from text or image inputs. Built with a 3D Variational Autoencoder and expert transformer architecture, it was trained on 35M video clips and 2B images. Notable for its spatial-temporal understanding and continuous motion handling through 3D attention mechanisms.
CogVideoX-5B is a large-scale text-to-video generation model based on a diffusion transformer architecture, released on August 27, 2024. The model represents a significant advancement in video generation capabilities, building upon the earlier CogVideo model released in May 2022.
The architecture incorporates several key components that enable high-quality video generation:
The model was trained on approximately 35 million single-shot video clips averaging 6 seconds each, combined with 2 billion filtered images. A sophisticated video captioning pipeline was developed using a pre-trained video captioning model, CogVLM for image recaptioning, and GPT-4 for summarization, with a fine-tuned Llama 2 model for acceleration. The training process employed BF16 precision and utilized a progressive training pipeline with multi-resolution frame packing.
CogVideoX-5B can generate:
The model demonstrates state-of-the-art performance based on both automated metrics and human evaluations, particularly in areas such as Human Action, Scene, Dynamic Degree, Multiple Objects, Appearance Style, and Dynamic Quality. As detailed in the technical paper, inference times are approximately:
For optimal performance, BF16 precision is recommended, though FP16 and FP32 are also supported. The model can run on consumer-grade hardware through various optimization techniques, including quantization options (BF16, FP16, FP32, FP8, and INT8).
The CogVideoX family includes several variants:
CogVideoX-2B (Released August 6, 2024):
CogVideoX-5B-I2V (Released September 19, 2024):
CogVideoX 1.5 (Released November 8, 2024):