Launch a dedicated cloud GPU server running Laboratory OS to download and run Stable Diffusion 1.1 using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Forge is a platform built on top of Stable Diffusion WebUI to make development easier, optimize resource management, speed up inference, and study experimental features.
Train your own LoRAs and finetunes for Stable Diffusion and Flux using this popular GUI for the Kohya trainers.
Model Report
stabilityai / Stable Diffusion 1.1
Stable Diffusion 1.1 is a latent text-to-image diffusion model developed by CompVis, Stability AI, and Runway that generates images from natural language prompts. The model uses a VAE to compress images into latent space, a U-Net for denoising, and a CLIP text encoder for conditioning. Trained on LAION dataset subsets at 512×512 resolution, it supports text-to-image generation, image-to-image translation, and inpainting applications while operating efficiently in compressed latent space.
Explore the Future of AI
Your server, your data, under your control
Stable Diffusion 1.1 is a latent text-to-image diffusion model developed by CompVis in collaboration with Stability AI and Runway. It is part of the v1 family of Stable Diffusion models, designed to synthesize images from natural language prompts with detail, while maintaining efficient computational requirements. The model utilizes advances in latent diffusion architectures, enabling both detailed image generation and versatile image manipulation capabilities, which supports research and creative applications (Stable Diffusion v1-1 model card, arXiv: Latent Diffusion Models).
A collage of sample images generated by Stable Diffusion 1.1, illustrating its creative diversity in response to varied text prompts.
Stable Diffusion 1.1 is based on the Latent Diffusion Model (LDM) framework (original paper). In this approach, image synthesis and transformation occur in a lower-dimensional latent space, significantly reducing computational demands during both training and inference.
The core architecture comprises three main components:
Variational Autoencoder (VAE): The VAE encodes images into a compressed latent representation. For Stable Diffusion, an input image of dimension 512×512 is mapped to a 64×64 latent space, achieving a compression factor of 64. During inference, the decoder reconstructs the output image from the denoised latent variables (model documentation).
U-Net Backbone: The U-Net serves as the backbone for the denoising process, utilizing an encoder-decoder configuration with ResNet blocks and shortcut connections. It predicts the noise in the current latent step and, conditioned on text embeddings, incrementally refines the image. The U-Net in this model contains approximately 860 million parameters.
Text Encoder: Text prompts are processed by a frozen CLIP ViT-L/14 encoder (CLIP paper). The resulting embeddings guide the U-Net through cross-attention layers to ensure text-conditional image generation.
The efficiency of Stable Diffusion arises from operating in this compressed latent domain, allowing high-resolution generation on standard GPUs (arXiv: Latent Diffusion Models).
Inference pipeline for Stable Diffusion 1.1: text prompts are embedded, transformed in latent space by the U-Net, and decoded into images by the VAE.
Stable Diffusion 1.1 was trained on large-scale, openly available datasets to support broad generalization capabilities. The primary training data consists of subsets of the LAION-5B database, a comprehensive resource of image–text pairs (LAION-5B blog).
The training strategy for v1.1 involved an initial pretraining phase on the laion2B-en dataset at 256×256 pixel resolution for 237,000 steps, followed by fine-tuning on high-resolution image samples (512×512 pixels) extracted from the laion-high-resolution subset, comprising over 170 million pairs (training details). This two-stage proxy enabled the model to attain both broad visual knowledge and improved high-fidelity synthesis.
The optimization was performed with the AdamW optimizer and large batch sizes, harnessing significant computational infrastructure (32×8 A100 GPUs, batch size 2048). The primary loss function was a reconstruction objective in the latent space. According to the published training report, total training consumed approximately 150,000 GPU hours (model card).
Functionality and Use Cases
Stable Diffusion 1.1 supports diverse modes of image generation and manipulation, driven by its latent diffusion mechanism:
Text-to-Image Generation: Given a natural language prompt, the model synthesizes new images from scratch, capturing a wide range of concepts, styles, and details (Hugging Face introduction).
Generated samples from varied prompts, exemplifying Stable Diffusion 1.1's capacity for generating diverse visual content.
Image-to-Image Translation: The model can receive an image and a guiding prompt, then transform the image based on textual instructions while maintaining structural coherence (model documentation).
A vivid fantasy landscape, demonstrating the model's image-to-image translation abilities.
Inpainting and Outpainting: Users can selectively modify regions of an image (inpainting) or extend images beyond their boundaries (outpainting) using targeted text prompts (model notes).
Artistic, Educational, and Research Applications: Use cases span digital art, design ideation, educational materials, and investigations into generative model limitations or biases (Hugging Face model card).
Further model outputs illustrating creative scene synthesis from descriptive text prompts.
Performance assessment of Stable Diffusion 1.1 employs quantitative metrics such as CLIP score (measuring semantic alignment between text and generated image) and Fréchet Inception Distance (FID, evaluating distributional similarity to real images). Evaluation was conducted on large sets of prompts sampled from the COCO2017 validation set at 512×512 resolution.
Checkpoint evaluations show that with increased classifier-free guidance (cfg-scale), images more closely follow prompts but may lose diversity. The default configuration for sd-v1-1.ckpt employs a guidance scale around 7.5 and 50 sampling steps, where the balance of semantic fidelity and visual realism is optimized (evaluation protocol).
Limitations
Despite its effectiveness, Stable Diffusion 1.1 demonstrates several noteworthy limitations:
Bias and Representation: The model reflects biases present in the web-scraped LAION-5B dataset, particularly toward Western and English-centric content. This can impact fairness and diversity in generated outputs (model card discussion).
Photorealism and Composition: While capable of generating convincing images, the model can underperform on tasks requiring perfect photorealism or complex compositional arrangements (e.g., spatial reasoning described in text). Rendering textual content within images or fine-grained details, especially for faces, remains challenging.
Resolution Constraints: Image quality may diminish below 512×512 resolution or when simultaneously increasing both width and height beyond this baseline, leading to artifacts and loss of global image coherence (technical documentation).
Memorization and Repetition: No deduplication methods were applied in dataset curation, leading to potential memorization of common or repeated images (Hugging Face model card).
Language Dependency: The model is most effective for prompts in English; generation quality degrades in other languages.
Fine-tuning Needs: Customizing the model for niche domains or styles typically requires computationally expensive fine-tuning.
Watermarking Robustness: Outputs include invisible watermarking for machine-generated image identification, but rescaling or rotating images can reduce its efficacy (invisible-watermark repository).
Licensing
Stable Diffusion 1.1 is distributed under the CreativeML OpenRAIL M license, which governs responsible use and redistribution (license terms). The license permits both research and commercial applications, with explicit restrictions to prevent generation or dissemination of illegal, harmful, or offensive content, as well as guidance for responsible model sharing and output use.
Release Timeline and Model Family
Stable Diffusion v1.1 was released in August 2022. It is part of a broader family of v1 checkpoints, including v1.2, v1.3, and v1.4, each representing further training and refinement on curated datasets with adjusted aesthetic filters and sampling strategies (CompVis at Hugging Face). Subsequent versions, such as Stable Diffusion 1.5, build progressively on the v1.1 foundation by extending training duration or integrating data filtered for higher aesthetic quality, offering alternative trade-offs in output characteristics (dataset details).
Example Outputs
Stable Diffusion 1.1's output highlights both deterministic and variable results based on seed settings and sampling parameters.
"A photograph of an astronaut riding a horse"—sample output generated under default settings.