Browse Models
Note: CogVideoX 1.5 5B weights are released under a CogVideoX License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host CogVideoX 1.5 5B. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
CogVideoX 1.5 5B generates videos up to 10 seconds at 16 FPS, with resolutions up to 1360x768. It uses a 3D Variational Autoencoder and expert transformer architecture, trained on 35M video clips and 2B images. Notable for its temporal consistency and support for both text-to-video and image-to-video generation.
CogVideoX 1.5 5B, released on November 8th, 2024, represents a significant advancement in text-to-video and image-to-video generation capabilities. This open-source model builds upon its predecessors in the CogVideoX family, offering substantial improvements in video resolution, duration, and generation quality. The model was developed by THUDM and is detailed in their research paper.
The model employs a diffusion transformer architecture with several innovative components. At its core is a 3D Variational Autoencoder (VAE) that efficiently compresses videos both spatially and temporally. The architecture incorporates an expert transformer with expert adaptive LayerNorm, which enhances text-video alignment through deep modality fusion. Position encoding is handled through 3d_rope_pos_embed, while 3D full attention is used for comprehensive video data modeling.
A notable technical feature is the progressive training pipeline that implements multi-resolution frame packing techniques. This allows the model to handle varying video lengths and resolutions effectively. The architecture also includes Explicit Uniform Sampling to stabilize the training process, as detailed in the technical documentation.
The model's training utilized approximately 35 million single-shot video clips, averaging 6 seconds in length, combined with 2 billion images from LAION-5B and COYO-700M datasets. The training process incorporated a sophisticated data processing pipeline that included rigorous video filtering based on various quality criteria. A notable aspect of the training was the use of CogVLM2-Caption, a specialized caption model that converts video data into detailed text descriptions.
The data preprocessing involved removing low-quality videos with editing artifacts, poor motion connectivity, lecture-style content, text-heavy frames, and noisy screenshots. This careful curation process contributed significantly to the model's ability to generate high-quality, semantically aligned outputs.
CogVideoX 1.5 5B can generate videos at resolutions up to 1360 x 768 pixels for text-to-video generation, with variable resolution support for image-to-video generation. Videos can be up to 10 seconds long at 16 frames per second, featuring significant motion and temporal consistency. The model supports English prompts with a maximum length of 224 tokens.
Performance benchmarks show impressive results across multiple automated metrics, including Human Action, Scene, Dynamic Degree, Multiple Objects, Appearance Style, Dynamic Quality, and GPT4o-MTScore. The model family includes both 5B and 2B parameter variants, with the 5B version demonstrating superior performance in human evaluations for Sensory Quality, Instruction Following, Physics Simulation, and Cover Quality.
For inference, the model supports multiple precision options including BF16 (recommended), FP16, FP32, FP8*, and INT8, though INT4 is not supported. Minimum GPU memory requirements vary by configuration:
The model includes several optimization techniques through the diffusers
library: