Browse Models
Note: CogVideoX 1.5 5B I2V weights are released under a CogVideoX License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host CogVideoX 1.5 5B I2V. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
CogVideoX 1.5 5B I2V transforms static images into 5-10 second videos at 16 FPS, supporting resolutions up to 1360 pixels. Built on a diffusion transformer architecture with 3D VAE, it was trained on 35M video clips and 2B images. Notable for multi-resolution frame packing and progressive training approach.
CogVideoX 1.5 5B I2V is a state-of-the-art image-to-video generation model released in November 2024 as part of the CogVideoX model family. Built on a diffusion transformer architecture, it represents a significant advancement in video generation capabilities, as detailed in the technical paper.
The model utilizes a sophisticated 3D Variational Autoencoder (VAE) for improved video compression and fidelity, combined with an expert transformer featuring expert adaptive LayerNorm for enhanced text-video alignment. This architecture enables deep fusion between text and video modalities while preventing common issues like flickering in generated videos.
A key innovation in the model's design is its progressive training approach and multi-resolution frame pack technique, which enables the generation of coherent, long-duration videos with significant motion. The model incorporates Explicit Uniform Sampling to stabilize training loss curves and accelerate convergence.
The model was trained on an extensive dataset comprising approximately 35 million single-shot video clips (averaging 6 seconds each) with corresponding text descriptions, supplemented by 2 billion high-quality images filtered from LAION-5B and COYO-700M datasets. The training process involved a sophisticated data processing pipeline that included:
CogVideoX 1.5 5B I2V can generate videos with flexible resolutions, where the minimum dimension must be 768 pixels and the maximum dimension between 768 and 1360 pixels (always divisible by 16). It supports video generation lengths of 5 or 10 seconds at 16 frames per second, with English prompts up to 224 tokens.
The model offers multiple inference precision options:
Memory requirements and performance vary significantly based on configuration:
Memory optimization tools like PytorchAO and Optimum-quanto can be used for quantization to reduce memory footprint, enabling inference on lower-VRAM GPUs.
The CogVideoX family includes several variants with different capabilities:
The 1.5 series represents a significant improvement over earlier versions, offering longer duration (10 seconds vs 6 seconds) and higher resolution capabilities (up to 1360x768 vs 720x480).