Browse Models
Note: Stable Video Diffusion XT 1.1 weights are released under a Stability AI Non-Commercial Research Community License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host Stable Video Diffusion XT 1.1. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Stable Video Diffusion XT 1.1 converts still images into 4-second video clips at 1024x576 resolution. Fine-tuned for 6FPS output, it uses temporal convolution and attention layers to maintain consistency across 25 frames. Built on Stable Diffusion 2.1 and trained on 152M video clips plus 1M curated samples.
Stable Video Diffusion XT 1.1 represents a significant advancement in AI-powered video generation, building upon the architecture of Stable Diffusion 2.1. The model utilizes latent diffusion technology, incorporating temporal convolution and attention layers specifically designed for video processing. This architecture enables the model to generate short video clips of up to 4 seconds from a single input image, producing output at 1024x576 resolution with 25 frames.
The development of Stable Video Diffusion models follows a sophisticated three-stage training process, as detailed in the research paper. The process begins with text-to-image pretraining using Stable Diffusion 2.1, followed by video pretraining on a massive dataset of approximately 152 million clips (LVD-F), and concludes with high-resolution video finetuning on a carefully curated dataset of about 1 million samples.
A distinguishing feature of the training methodology is its systematic data curation process. This includes:
The XT 1.1 variant specifically represents a fine-tuned version of the SVD Image-to-Video model, with optimizations focused on output consistency at 6FPS and Motion Bucket Id 127. While these parameters were fixed during fine-tuning, they remain adjustable in use, though performance may vary compared to SVD 1.0 when using different settings.
The model demonstrates several key capabilities:
However, users should be aware of several limitations:
The model is released under the "stable-video-diffusion-1-1-community" license, available for both research and commercial applications. Commercial use requires registration with Stability AI if annual revenue exceeds $1,000,000. Research applications can include:
Users must adhere to Stability AI's Acceptable Use Policy, which prohibits generating factually accurate representations of people or events without proper authorization.