Browse Models
The simplest way to self-host Stable Diffusion 2. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Stable Diffusion 2 is a text-to-image model using Latent Diffusion architecture and OpenCLIP-ViT/H encoder. Trained on LAION-5B for 290k steps, it generates images at 512x512 or 768x768 pixels. Features depth-aware generation, inpainting, and upscaling capabilities up to 2048x2048 resolution.
Stable Diffusion 2 represents a significant advancement in text-to-image AI technology, released by Stability AI on November 24, 2022. Building upon its predecessor's success, this open-source model introduces substantial improvements in image quality and capabilities while maintaining accessibility for researchers and developers.
At its core, Stable Diffusion 2 is a Latent Diffusion Model that utilizes a pretrained OpenCLIP-ViT/H text encoder, primarily operating in English. The model builds upon the Stable Diffusion 2-base model, undergoing additional training for 150,000 steps using a v-objective, followed by 140,000 steps on 768x768 images.
The model's architecture represents a significant departure from its predecessor, particularly in its text encoder. While Stable Diffusion v1 used a frozen CLIP ViT-L/14 text encoder, version 2 implements the more advanced OpenCLIP-ViT/H, developed by LAION with Stability AI's support, resulting in notably improved image fidelity.
The training process utilized the LAION-5B dataset and its subsets, carefully filtered for explicit content using LAION's NSFW detector with conservative thresholds. The training procedure involved encoding images into latent representations using an autoencoder while processing text prompts through OpenCLIP-ViT/H. Training was conducted on 32 x 8 x A100 GPUs using the AdamW optimizer.
Stable Diffusion 2 comes in several variants:
512-base-ema.ckpt
: Trained on filtered LAION-5B subset768-v-ema.ckpt
: Extended training from base model512-depth-ema.ckpt
: Incorporates MiDaS depth prediction512-inpainting-ema.ckpt
: Specialized for inpainting tasksx4-upscaling-ema.ckpt
: Text-guided latent upscaling modelThe model generates images at default resolutions of 512x512 and 768x768 pixels, with the potential for higher resolutions (up to 2048x2048 or beyond) when combined with the included super-resolution upscaler. A notable addition is the depth-guided model, depth2img, which uses MiDaS depth estimation to generate new images while maintaining structural coherence.
For optimal performance, implementation can be achieved through the diffusers
library, requiring installation of additional dependencies including transformers
, accelerate
, scipy
, and safetensors
. The xformers
library is recommended for memory-efficient attention, and pipe.enable_attention_slicing()
can be used to reduce VRAM usage at the cost of speed.
The model's rapid adoption and impact on the AI community is evident in its GitHub performance:
The model acknowledges several limitations, including:
The environmental impact is estimated at 15,000 kg CO2 eq., calculated using the Machine Learning Impact calculator based on the hardware and training time used for Stable Diffusion v1.