Stable Diffusion 1.1 | Open Laboratory

Stable Diffusion 1.1 is a latent text-to-image diffusion model developed by CompVis in collaboration with Stability AI and Runway. It is part of the v1 family of Stable Diffusion models, designed to synthesize images from natural language prompts with detail, while maintaining efficient computational requirements. The model utilizes advances in latent diffusion architectures, enabling both detailed image generation and versatile image manipulation capabilities, which supports research and creative applications (Stable Diffusion v1-1 model card, arXiv: Latent Diffusion Models).

Collage of five imaginative images generated by Stable Diffusion 1.1

Model Architecture

Stable Diffusion 1.1 is based on the Latent Diffusion Model (LDM) framework (original paper). In this approach, image synthesis and transformation occur in a lower-dimensional latent space, significantly reducing computational demands during both training and inference.

The core architecture comprises three main components:

Variational Autoencoder (VAE): The VAE encodes images into a compressed latent representation. For Stable Diffusion, an input image of dimension 512×512 is mapped to a 64×64 latent space, achieving a compression factor of 64. During inference, the decoder reconstructs the output image from the denoised latent variables (model documentation).
U-Net Backbone: The U-Net serves as the backbone for the denoising process, utilizing an encoder-decoder configuration with ResNet blocks and shortcut connections. It predicts the noise in the current latent step and, conditioned on text embeddings, incrementally refines the image. The U-Net in this model contains approximately 860 million parameters.
Text Encoder: Text prompts are processed by a frozen CLIP ViT-L/14 encoder (CLIP paper). The resulting embeddings guide the U-Net through cross-attention layers to ensure text-conditional image generation.

The efficiency of Stable Diffusion arises from operating in this compressed latent domain, allowing high-resolution generation on standard GPUs (arXiv: Latent Diffusion Models).

Diagram of Stable Diffusion inference pipeline

Training Data and Procedure

Stable Diffusion 1.1 was trained on large-scale, openly available datasets to support broad generalization capabilities. The primary training data consists of subsets of the LAION-5B database, a comprehensive resource of image–text pairs (LAION-5B blog).

The training strategy for v1.1 involved an initial pretraining phase on the laion2B-en dataset at 256×256 pixel resolution for 237,000 steps, followed by fine-tuning on high-resolution image samples (512×512 pixels) extracted from the laion-high-resolution subset, comprising over 170 million pairs (training details). This two-stage proxy enabled the model to attain both broad visual knowledge and improved high-fidelity synthesis.

The optimization was performed with the AdamW optimizer and large batch sizes, harnessing significant computational infrastructure (32×8 A100 GPUs, batch size 2048). The primary loss function was a reconstruction objective in the latent space. According to the published training report, total training consumed approximately 150,000 GPU hours (model card).

Functionality and Use Cases

Stable Diffusion 1.1 supports diverse modes of image generation and manipulation, driven by its latent diffusion mechanism:

Text-to-Image Generation: Given a natural language prompt, the model synthesizes new images from scratch, capturing a wide range of concepts, styles, and details (Hugging Face introduction).

Collage of generated images illustrating text-to-image capabilities

Image-to-Image Translation: The model can receive an image and a guiding prompt, then transform the image based on textual instructions while maintaining structural coherence (model documentation).

Landscape generated by the model

Inpainting and Outpainting: Users can selectively modify regions of an image (inpainting) or extend images beyond their boundaries (outpainting) using targeted text prompts (model notes).
Artistic, Educational, and Research Applications: Use cases span digital art, design ideation, educational materials, and investigations into generative model limitations or biases (Hugging Face model card).

Collage of imaginative generated artworks

Model Performance and Evaluation

Performance assessment of Stable Diffusion 1.1 employs quantitative metrics such as CLIP score (measuring semantic alignment between text and generated image) and Fréchet Inception Distance (FID, evaluating distributional similarity to real images). Evaluation was conducted on large sets of prompts sampled from the COCO2017 validation set at 512×512 resolution.

Checkpoint evaluations show that with increased classifier-free guidance (cfg-scale), images more closely follow prompts but may lose diversity. The default configuration for sd-v1-1.ckpt employs a guidance scale around 7.5 and 50 sampling steps, where the balance of semantic fidelity and visual realism is optimized (evaluation protocol).

Limitations

Despite its effectiveness, Stable Diffusion 1.1 demonstrates several noteworthy limitations:

Bias and Representation: The model reflects biases present in the web-scraped LAION-5B dataset, particularly toward Western and English-centric content. This can impact fairness and diversity in generated outputs (model card discussion).
Photorealism and Composition: While capable of generating convincing images, the model can underperform on tasks requiring perfect photorealism or complex compositional arrangements (e.g., spatial reasoning described in text). Rendering textual content within images or fine-grained details, especially for faces, remains challenging.
Resolution Constraints: Image quality may diminish below 512×512 resolution or when simultaneously increasing both width and height beyond this baseline, leading to artifacts and loss of global image coherence (technical documentation).
Memorization and Repetition: No deduplication methods were applied in dataset curation, leading to potential memorization of common or repeated images (Hugging Face model card).
Language Dependency: The model is most effective for prompts in English; generation quality degrades in other languages.
Fine-tuning Needs: Customizing the model for niche domains or styles typically requires computationally expensive fine-tuning.
Watermarking Robustness: Outputs include invisible watermarking for machine-generated image identification, but rescaling or rotating images can reduce its efficacy (invisible-watermark repository).

Licensing

Stable Diffusion 1.1 is distributed under the CreativeML OpenRAIL M license, which governs responsible use and redistribution (license terms). The license permits both research and commercial applications, with explicit restrictions to prevent generation or dissemination of illegal, harmful, or offensive content, as well as guidance for responsible model sharing and output use.

Release Timeline and Model Family

Stable Diffusion v1.1 was released in August 2022. It is part of a broader family of v1 checkpoints, including v1.2, v1.3, and v1.4, each representing further training and refinement on curated datasets with adjusted aesthetic filters and sampling strategies (CompVis at Hugging Face). Subsequent versions, such as Stable Diffusion 1.5, build progressively on the v1.1 foundation by extending training duration or integrating data filtered for higher aesthetic quality, offering alternative trade-offs in output characteristics (dataset details).

Example Outputs

Stable Diffusion 1.1's output highlights both deterministic and variable results based on seed settings and sampling parameters.

Photorealistic output: astronaut riding a horse

Astronaut riding a horse, generated with fewer inference steps

Multiple style variations of astronaut riding a horse

Helpful Links