Browse Models
The simplest way to self-host CogVideoX 2B. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
CogVideoX-2B generates 6-second videos (720x480, 8fps) from text descriptions. The 2B parameter model uses a 3D Causal VAE architecture with adaptive LayerNorm and 3D full attention. Trained on 35M video clips with multi-model captioning, it serves as the entry point to the CogVideoX family of text-to-video models.
CogVideoX-2B is a large-scale text-to-video generation model released on August 6, 2024, designed as an entry-level option in the CogVideoX family. Based on a diffusion transformer architecture, it represents a significant advancement in text-to-video generation technology while maintaining accessibility through lower hardware requirements compared to its larger siblings.
The model employs a 3D Causal VAE for video reconstruction and compression, combined with an expert transformer featuring expert adaptive LayerNorm for enhanced text-video alignment. As detailed in the research paper, the architecture incorporates 3D full attention to comprehensively model both temporal and spatial dimensions.
CogVideoX-2B generates videos at 720x480 resolution with 8 frames per second and supports videos up to 6 seconds in length. The model utilizes 3D sincos positional encoding and currently only supports English prompts with a maximum length of 226 tokens.
The training dataset comprised approximately 35 million single-shot video clips (averaging 6 seconds each) paired with text descriptions. The data preparation involved a sophisticated video captioning pipeline leveraging pre-trained models and GPT-4 for creating detailed descriptions. Additionally, 2 billion images from LAION-5B and COYO-700M datasets were incorporated into the training process.
The training methodology employed several innovative techniques:
CogVideoX-2B is notably flexible in terms of deployment options and hardware requirements. The model supports multiple precision formats:
Using the diffusers library with optimizations, the minimum GPU memory requirement is 4GB with FP16 and can be reduced to 3.6GB with INT8 quantization. Inference speeds are approximately 90 seconds on a single A100 and 45 seconds on an H100 (with 50 steps, using FP/BF16).
Within the CogVideoX family, several variants offer different capabilities:
CogVideoX-2B distinguishes itself as the most accessible option, capable of running on older GPUs like the GTX 1080TI, unlike its larger siblings.
CogVideoX-2B is released under the Apache 2.0 License, making it freely available for both research and commercial use. The model weights, including those for the 3D Causal VAE and video caption model, are publicly available through the official repository.