CogVideoX 5B I2V

Family

CogVideoX

Type

Foundation Model

License

CogVideoX License

Released

2024-09-19

How To Use

Note: CogVideoX 5B I2V weights are released under a CogVideoX License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.

Lab Station OS

The simplest way to self-host CogVideoX 5B I2V. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.

Direct Download

Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

comfyanonymous /

ComfyUI

Generate images and videos using a powerful low-level workflow graph builder - the fastest, most flexible, and most advanced visual generation UI.

Model Report

THUDM / CogVideoX 5B I2V

CogVideoX-5B I2V is a 5-billion parameter model that generates videos from text or images. It creates 10-second videos at 16fps with resolutions up to 1360 pixels. Trained on 35M video clips and 2B images, it uses a 3D VAE architecture with adaptive LayerNorm for frame consistency.

Explore the Future of AI

Your server, your data, under your control

CogVideoX-5B I2V is a 5-billion parameter image-to-video generation model released on September 19, 2024. It represents a significant advancement in controlled video generation, allowing users to input both images and text prompts to generate videos. The model is built on a diffusion transformer architecture that incorporates several innovative components, as detailed in the technical paper.

The architecture leverages a 3D Variational Autoencoder (VAE) for efficient video compression, combined with an expert transformer featuring expert adaptive LayerNorm for improved text-video alignment. This design enables deep modality fusion between text and visual elements, resulting in more coherent and controllable video generation.

Capabilities and Performance

CogVideoX-5B I2V supports video generation with resolutions ranging from 768 pixels (minimum) up to 1360 pixels (maximum) in either dimension, with the requirement that dimensions must be multiples of 16. The model generates videos at 16 frames per second, with capabilities for longer duration outputs compared to previous models.

The model demonstrates superior performance across multiple automated metrics and human evaluations, including Human Action, Scene, Dynamic Degree, Multiple Objects, Appearance Style, and Dynamic Quality measures. When compared to other models in the CogVideoX family, the 5B I2V variant offers enhanced controllability through its image input capability, distinguishing it from text-only models like CogVideoX-2B.

Technical Specifications and Implementation

The model supports various precision options for inference:

BF16 (recommended)
FP16
FP32
FP8
INT8 (through quantization)

Memory requirements vary based on the chosen precision and optimization techniques. Using the diffusers library with BF16 precision requires a minimum of 5GB of GPU memory, while INT8 quantization can reduce this to 4.4GB. The model supports English prompts with a maximum length of 226 tokens.

Model Family Comparison

Within the CogVideoX family, several variants exist:

CogVideoX-5B I2V: Specialized for image-to-video generation with enhanced control
CogVideoX-2B: A smaller variant with more limited resolution and duration capabilities
CogVideoX-5B: The base text-to-video model
CogVideoX-1.5-5B: An improved version with additional capabilities

The 5B I2V variant distinguishes itself through its ability to accept image inputs as backgrounds, providing greater control over the generated video content compared to text-only variants.

Training and Development

The model's training leveraged approximately 35 million single-shot video clips with text descriptions, complemented by 2 billion filtered images from LAION-5B and COYO-700M datasets. A sophisticated video captioning pipeline was implemented during training, utilizing multiple datasets and advanced language models to generate accurate video descriptions.

References and Resources

CogVideoX Technical Report: Comprehensive research paper detailing the model architecture and performance
Official GitHub Repository: Source code and model weights
Hugging Face Model: Download and implementation details
Model Documentation: Detailed technical specifications and benchmarks
Fine-tuning Guide: Instructions for model fine-tuning
CogVideoX-Factory: Cost-effective fine-tuning framework
diffusers-torchao: Repository for quantized inference
PytorchAO: Optimization library for deployment
Optimum-quanto: Additional optimization tools