Browse Models
Stable Diffusion 2 represents a significant evolution in text-to-image artificial intelligence technology, introduced by Stability AI on November 24, 2022. This comprehensive model family builds upon the success of its predecessor while introducing substantial improvements in image quality and capabilities.
The Stable Diffusion 2 family emerged as a substantial advancement over its predecessor, marking a significant architectural shift in text-to-image generation. The development was spearheaded by Stability AI, with the base model serving as the foundation for several specialized variants. The model family's creation involved extensive training on the LAION-5B dataset, with careful content filtering implemented through LAION's NSFW detector using conservative thresholds.
The training process was conducted on an impressive hardware setup of 32 x 8 x A100 GPUs, utilizing the AdamW optimizer. This substantial computational investment resulted in an estimated environmental impact of 15,000 kg CO2 eq., as calculated using the Machine Learning Impact calculator.
At its core, Stable Diffusion 2 employs a Latent Diffusion Model architecture, distinguished by its use of a pretrained OpenCLIP-ViT/H text encoder. This marks a significant departure from the previous version's frozen CLIP ViT-L/14 text encoder. The architectural change, developed in collaboration with LAION, has resulted in notably improved image fidelity and generation capabilities.
The Stable Diffusion 2 family encompasses several specialized variants, each designed for specific use cases while sharing the same fundamental architecture. The base model, trained on a filtered subset of LAION-5B, underwent 150,000 training steps using a v-objective, followed by an additional 140,000 steps on 768x768 images for the enhanced version.
The family includes the base 512x512 model, an upgraded 768x768 version, a depth-aware variant incorporating MiDaS depth prediction, a specialized inpainting model, and a text-guided latent upscaling model. Each variant maintains the core architecture while introducing specialized capabilities for specific use cases.
The Stable Diffusion 2 family demonstrates impressive versatility across various image generation tasks. The models can generate images at default resolutions of 512x512 and 768x768 pixels, with the potential for higher resolutions up to 2048x2048 or beyond when combined with the included super-resolution upscaler.
A notable advancement in the family is the depth-guided model (depth2img), which utilizes MiDaS depth estimation to generate new images while maintaining structural coherence. This capability has proven particularly valuable for architectural and design applications.
The model family's impact on the AI community is evidenced by its rapid adoption and GitHub performance:
Implementation of Stable Diffusion 2 models is primarily achieved through the diffusers
library, requiring additional dependencies including transformers, accelerate, scipy, and safetensors. For optimal performance, the xformers library is recommended for memory-efficient attention, while pipe.enable_attention_slicing() can be employed to reduce VRAM usage at the cost of processing speed.
The model family's technical foundation is detailed in the Latent Diffusion Models Paper, while implementation details can be found in the Stable Diffusion GitHub Repository.
Despite its advanced capabilities, the Stable Diffusion 2 family acknowledges several limitations. These include challenges with photorealistic rendering, difficulties in generating legible text, and struggles with complex compositional tasks. The models also exhibit some limitations in face and human generation, and show bias from their primarily English-language training data.
The environmental impact of training these models is significant, though efforts have been made to optimize the training process and reduce computational requirements where possible. The models are released under the CreativeML Open RAIL++-M License, promoting responsible AI development while maintaining accessibility for researchers and developers.
The Stable Diffusion 2 family continues to evolve through community contributions and official updates. The open-source nature of the project, combined with its modular architecture, allows for ongoing improvements and specialized applications. Future developments are expected to address current limitations while expanding the model family's capabilities in areas such as multi-lingual support and enhanced photorealism.
For implementation details and updates, developers can refer to the Hugging Face Diffusers Library, which serves as the primary framework for deploying these models in practical applications.