CogVideoX 2B

CogVideoX-2B is an open-source text-to-video diffusion model developed by THUDM that generates videos up to 720×480 resolution and six seconds in length. The model employs a 3D causal variational autoencoder and Expert Transformer architecture with 3D rotary position embeddings for temporal coherence. Trained on 35 million video clips and 2 billion images using progressive training techniques, it supports INT8 quantization and is released under Apache 2.0 license.

Model Architecture

CogVideoX-2B is based on a diffusion Transformer architecture, leveraging several core components to efficiently generate video content from text prompts. At the heart of the model is a 3D causal VAE, which compresses and reconstructs video data across spatial and temporal dimensions, substantially reducing the input sequence length and computational demands. The VAE employs three-dimensional convolutions, yielding higher compression ratios and frame continuity, and uses temporally causal convolutions to ensure that future information does not impact current frame predictions CogVideoX technical report.

Text-to-video alignment is further supported through the use of an Expert Transformer architecture, featuring Expert Adaptive LayerNorm (AdaLN) modules that facilitate deep fusion of language and vision modalities. Video features are encoded using a Patchify approach, which enables joint image and video training by reformatting image and video inputs as sequences of frame-based latent codes. For temporal and spatial understanding, the model uses a 3D rotary position embedding (3D-RoPE), separately encoding x, y, and t coordinates to maintain long-range coherence in generated scenes.

Grid of CogVideoX-generated video samples

A summary collage of prompt-to-video sequences generated by the CogVideoX model, each row associated with a textual prompt and a distinct visual scene.

Full Size Image Image Source

Training Data and Techniques

CogVideoX-2B is trained on a broad and diverse visual corpus, combining approximately 35 million high-quality, single-shot video clips—each paired with descriptive text—and over 2 billion filtered images from large-scale datasets such as LAION-5B and COYO-700M. Videos are screened using a filtration pipeline built on multiple criteria, such as motion continuity and image aesthetics, to maximize output quality data details.

In handling mixed data types and durations, CogVideoX-2B treats static images as single-frame videos, utilizing mixed-duration training and the Multi-Resolution Frame Pack technique to combine variable-length and different resolution videos within a batch. This approach helps optimize dataset utilization and model generalization.

Progressive training is applied by first exposing the model to low-resolution, short-duration content to capture basic semantics and gradually increasing both resolution and video length throughout training. Explicit uniform sampling of diffusion timesteps ensures stable training and consistent loss curves. To improve video-text alignment, a multi-stage captioning pipeline generates dense descriptions for video data by combining clip-level and frame-level captions, summarizing with large language models such as GPT-4 and Llama 2.

Capabilities and Features

CogVideoX-2B can generate videos up to 720×480 pixels at lengths of up to six seconds and eight frames per second, with output encompassing a wide range of scene dynamics and semantic complexity CogVideoX-2B Hugging Face page. The model maintains temporal and spatial coherence in the resulting videos and supports diverse aspect ratios through careful handling during progressive training.

Features include prompt optimization, where input prompts are refined by large language models such as GLM-4 or GPT-4 for higher output quality. The model accommodates INT8 and FP8 quantized inference, thereby reducing memory requirements and allowing operation on lower-resource hardware. Additionally, the architecture is designed for scalability, supporting future enhancements with larger model sizes and training datasets.

Sample output from CogVideoX-2B, demonstrating text-to-video generation from a single text prompt. [Source]

Performance Evaluation

CogVideoX-2B's performance is assessed using automated metrics from the Vbench benchmark, measuring aspects such as Human Action (96.6), Scene (55.35), Dynamic Degree (66.39), Multiple Objects (57.68), Appearance Style (24.37), Dynamic Quality (57.7), and GPT4o-MT Score (3.09) technical report. The 3D VAE deployed within CogVideoX-2B achieves a peak signal-to-noise ratio (PSNR) of 29.1 and a frame flickering score of 85.5 on standard benchmarks.

The use of Explicit Uniform Sampling during diffusion training is reported to yield more consistent loss reduction and better model convergence. Human evaluation studies involving larger CogVideoX models assess perceptual criteria, such as sensory quality and prompt adherence, in comparison to other contemporary generative models.

A 32-frame sequential output from CogVideoX models showing video continuity. Prompt: 'A man is running in the sea.'

Full Size Image Image Source

Limitations

CogVideoX-2B is primarily trained and evaluated on English prompts, with multi-language support relying on external language models for translation. Memory optimization features, such as model CPU offload, may reduce VRAM usage but can impact inference speed. Running the model under quantized (INT8) inference may result in slight slowdowns, though with minimal degradation in visual quality. Video outputs are subject to resolution and duration ceilings, with CogVideoX-2B supporting up to 720×480 resolution and six seconds per clip. Semantic fidelity may decrease slightly when fine-tuned on specialized, visually focused training subsets, and aggressive VAE compression ratios can challenge model convergence full discussion.

Laboratory OS

Direct Download

ComfyUI

Explore the Future of AI

Your server, your data, under your control