Browse Models
Note: Stable Video Diffusion XT weights are released under a Stability AI Non-Commercial Research Community License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host Stable Video Diffusion XT. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Stable Video Diffusion XT transforms single images into 4-second videos at 576x1024 resolution. Built on Stable Diffusion 2.1, it was trained on 152M video clips and generates 25 frames with improved temporal consistency. Key advances include multi-view synthesis capabilities and enhanced frame-to-frame coherence.
Stable Video Diffusion XT represents a significant advancement in AI-powered video generation technology, building upon Stability AI's successful Stable Diffusion image model foundation. Released in November 2023 as a research preview, this latent video diffusion model specializes in converting single images into short video clips of up to 4 seconds in length.
The model's architecture extends the capabilities of Stable Diffusion 2.1 by incorporating temporal convolution and attention layers. The training process followed three critical stages:
The training process was computationally intensive, utilizing approximately 200,000 A100 80GB hours and resulting in an estimated carbon footprint of ~19,000kg CO2 eq. and energy consumption of ~64,000 kWh.
Data curation played a crucial role in the model's development. The team implemented a systematic process that included:
SVD-XT generates 25 frames at 576x1024 resolution, an improvement over the original SVD model's 14 frames. The model incorporates:
Human preference studies have demonstrated that SVD-Image-to-Video XT outperforms competing models like RunwayML GEN-2 and PikaLabs in terms of video quality. The model exhibits strong multi-view 3D-prior capabilities, making it suitable for generating multiple views of objects in a feedforward manner.
The primary applications for SVD-XT include:
However, the model does have several notable limitations:
The model is available under the "stable-video-diffusion-community" license, with commercial use requiring adherence to Stability AI's commercial license. The complete codebase is accessible through the official GitHub repository.