Browse Models
Note: Stable Video Diffusion weights are released under a Stability AI Non-Commercial Research Community License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host Stable Video Diffusion. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Stable Video Diffusion converts still images into short video clips (up to 4 seconds) by extending Stable Diffusion 2.1 with temporal layers. Trained on 152M curated video clips, it offers two variants: standard (14 frames) and XT (25 frames). Notable for smooth motion generation and built-in frame interpolation capabilities.
Stable Video Diffusion (SVD) represents a significant advancement in AI-generated video technology, developed by Stability AI as their first foundation model for generative video. Released in late 2023, it builds upon their successful Stable Diffusion image model technology to create high-quality video content from both images and text prompts.
The model employs a latent diffusion architecture, modified from Stable Diffusion 2.1's UNet design with additional temporal convolutional and attention layers. The training process involves three crucial stages:
Data curation played a vital role in the model's development, involving sophisticated processes such as cut detection, motion annotation through optical flow analysis, caption generation (using CoCa, V-BLIP, and LLM), and aesthetics scoring via CLIP embeddings. This meticulous approach to data curation has proven superior to using uncurated datasets, as detailed in the research paper.
SVD comes in two main variants:
Both models support customizable frame rates between 3-30 fps and utilize specialized decoders for ensuring temporal consistency. The f8-decoder specifically focuses on maintaining visual coherence across frames.
Key capabilities include:
The model demonstrates impressive performance in human evaluation studies, showing higher preference rates compared to competing solutions like RunwayML GEN-2 and PikaLabs.
The model's development required substantial computational resources, with training consuming approximately 200,000 A100 80GB hours and generating an estimated 19,000 kg CO2 eq. emission. Inference times on an A100 80GB card are:
Notable limitations include:
The model is primarily intended for research purposes, including:
Commercial use requires adherence to Stability AI's license. The model explicitly prohibits generating factually accurate representations of people or events and any activities violating their Acceptable Use Policy.