Browse Models
The simplest way to self-host Stable Diffusion XL. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Stable Diffusion XL combines a 3.5B parameter base model with a 6.6B refinement model, featuring dual text encoders and a larger UNet backbone. Its distinctive conditioning scheme for image dimensions and improved training approach across multiple resolutions enables better handling of varied aspect ratios and complex elements.
Stable Diffusion XL (SDXL) represents a significant advancement in open-source text-to-image generation models, building upon previous iterations of Stable Diffusion with substantial architectural improvements and enhanced capabilities. Developed by Stability AI, SDXL demonstrates superior performance across various image generation tasks while maintaining accessibility for researchers and developers.
SDXL employs a two-stage latent diffusion architecture consisting of a base model and a refinement model. The base model, which can function independently, contains 3.5 billion parameters, while the refinement model adds another 6.6 billion parameters to the ensemble pipeline. This represents a significant scaling up from previous versions, with the UNet backbone alone containing 2.6 billion parameters - three times larger than its predecessors, as detailed in the technical paper.
A key architectural innovation is the use of two fixed, pretrained text encoders: OpenCLIP-ViT/G and CLIP-ViT/L. This dual-encoder approach enables a larger cross-attention context, contributing to improved image generation quality. The model also introduces novel conditioning schemes, including image size and cropping parameters, which help address artifacts present in earlier versions.
The training process for SDXL involved multiple stages, beginning with pretraining at 256x256 resolution, followed by 512x512 resolution training, and culminating in multi-aspect ratio finetuning around 1024x1024 pixel area. The model's autoencoder was retrained from scratch, resulting in improved reconstruction performance compared to previous versions.
SDXL excels at generating high-quality images across diverse art styles, with particular strength in photorealism. It handles complex concepts more effectively than its predecessors and requires simpler prompts to achieve high-quality results. The model demonstrates improved understanding of nuanced concepts and generates images at a native resolution of 1024x1024.
User studies show SDXL significantly outperforming previous Stable Diffusion versions (1.4/1.5 and 2.0/2.1) in terms of user preference. While traditional metrics like FID and CLIP scores don't fully capture these improvements, human evaluation demonstrates clear advantages over earlier models.
However, the model does have some limitations. These include: