Browse Models
Note: CogVideoX 5B I2V weights are released under a CogVideoX License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host CogVideoX 5B I2V. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
CogVideoX-5B I2V is a 5-billion parameter model that generates videos from text or images. It creates 10-second videos at 16fps with resolutions up to 1360 pixels. Trained on 35M video clips and 2B images, it uses a 3D VAE architecture with adaptive LayerNorm for frame consistency.
CogVideoX-5B I2V is a 5-billion parameter image-to-video generation model released on September 19, 2024. It represents a significant advancement in controlled video generation, allowing users to input both images and text prompts to generate videos. The model is built on a diffusion transformer architecture that incorporates several innovative components, as detailed in the technical paper.
The architecture leverages a 3D Variational Autoencoder (VAE) for efficient video compression, combined with an expert transformer featuring expert adaptive LayerNorm for improved text-video alignment. This design enables deep modality fusion between text and visual elements, resulting in more coherent and controllable video generation.
CogVideoX-5B I2V supports video generation with resolutions ranging from 768 pixels (minimum) up to 1360 pixels (maximum) in either dimension, with the requirement that dimensions must be multiples of 16. The model generates videos at 16 frames per second, with capabilities for longer duration outputs compared to previous models.
The model demonstrates superior performance across multiple automated metrics and human evaluations, including Human Action, Scene, Dynamic Degree, Multiple Objects, Appearance Style, and Dynamic Quality measures. When compared to other models in the CogVideoX family, the 5B I2V variant offers enhanced controllability through its image input capability, distinguishing it from text-only models like CogVideoX-2B.
The model supports various precision options for inference:
Memory requirements vary based on the chosen precision and optimization techniques. Using the diffusers library with BF16 precision requires a minimum of 5GB of GPU memory, while INT8 quantization can reduce this to 4.4GB. The model supports English prompts with a maximum length of 226 tokens.
Within the CogVideoX family, several variants exist:
The 5B I2V variant distinguishes itself through its ability to accept image inputs as backgrounds, providing greater control over the generated video content compared to text-only variants.
The model's training leveraged approximately 35 million single-shot video clips with text descriptions, complemented by 2 billion filtered images from LAION-5B and COYO-700M datasets. A sophisticated video captioning pipeline was implemented during training, utilizing multiple datasets and advanced language models to generate accurate video descriptions.