Stable Diffusion 3.5 Turbo

Family

Stable Diffusion 3

Type

Fine-Tuned Model

License

Stability AI Non-Commercial Research Community License

Released

2024-10-22

How To Use

Note: Stable Diffusion 3.5 Turbo weights are released under a Stability AI Non-Commercial Research Community License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run Stable Diffusion 3.5 Turbo using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

comfyanonymous /

ComfyUI

Generate images and videos using a powerful low-level workflow graph builder - the fastest, most flexible, and most advanced visual generation UI.

lllyasviel /

Stable Diffusion WebUI Forge

Forge is a platform built on top of Stable Diffusion WebUI to make development easier, optimize resource management, speed up inference, and study experimental features.

Model Report

stabilityai / Stable Diffusion 3.5 Turbo

Stable Diffusion 3.5 Turbo is a text-to-image generation model developed by Stability AI using a Multimodal Diffusion Transformer architecture with three text encoders and Query-Key Normalization. The model employs Adversarial Diffusion Distillation to enable high-quality image synthesis in as few as four denoising steps, providing efficient inference while maintaining prompt adherence and visual fidelity across diverse artistic styles.

Explore the Future of AI

Your server, your data, under your control

Stable Diffusion 3.5 Turbo is a generative image synthesis model developed by Stability AI, forming part of the broader Stable Diffusion 3.5 model family. Released on October 22, 2024, this Turbo variant emphasizes efficient image generation while retaining high image fidelity and prompt alignment. The model represents an evolution of the earlier Stable Diffusion architectures, leveraging innovative neural components and training strategies for improved speed and performance. Stable Diffusion 3.5 Turbo utilizes advanced diffusion techniques and transformer-based designs to generate diverse and highly detailed visual content from text prompts.

Photorealistic image generated by Stable Diffusion 3.5, featuring a woman lying in grass with daisies, illustrating model capabilities

Model Architecture and Techniques

At the technical core of Stable Diffusion 3.5 Turbo lies the Multimodal Diffusion Transformer (MMDiT) framework. This architecture combines the strengths of transformer models with diffusion-based image synthesis, enabling efficient and high-fidelity generative capabilities. The MMDiT system integrates three fixed, pretrained text encoders, specifically OpenCLIP-ViT/G, CLIP-ViT/L, and T5-xxl, each tuned for handling linguistic context at various stages of processing.

One of the distinctive features of this model is the use of Query-Key Normalization (QK-normalization) in its transformer blocks. This mechanism stabilizes the training process and simplifies subsequent fine-tuning. To accelerate image generation, Stable Diffusion 3.5 Turbo employs Adversarial Diffusion Distillation (ADD), a technique that enables the model to achieve high-quality generation with a low number of sampling steps—typically as few as four.

Diagram of the Multimodal Diffusion Transformer (MMDiT) architecture used by Stable Diffusion 3.5 Turbo

Capabilities and Performance

Stable Diffusion 3.5 Turbo is engineered for rapid image generation, providing efficient inference without sacrificing significant prompt adherence or visual quality. The use of ADD distillation allows for image synthesis with as few as four denoising steps, resulting in lower generation latency and reduced computational requirements relative to comparable non-distilled models. The model also includes architectural optimizations making it feasible to run on a range of consumer hardware, as demonstrated by benchmarking data that place Stable Diffusion 3.5 Medium, a related variant, at only 9.9 GB of VRAM usage (excluding text encoders).

Compatibility chart comparing VRAM requirements across GPUs for Stable Diffusion 3.5 and similar models

Empirical evaluation has shown that Turbo maintains competitiveness in both qualitative output and prompt alignment metrics. Evaluations against a field of open-source models placed Stable Diffusion 3.5 Large—the foundational architecture for Turbo—among the leaders in prompt adherence and aesthetic quality when assessed with human and automated scoring.

Bar chart comparing prompt adherence and aesthetic quality for multiple models including SD 3.5 variants

Diversity and Visual Versatility

Central to the Turbo variant is its ability to consistently generate diverse and realistic outputs, even with minimal explicit instruction in prompts. Dedicated benchmarking scenarios have demonstrated the model’s capacity to produce a wide spectrum of skin tones, facial features, and aesthetic variations, supporting broader representation and creative flexibility.

Triptych of generated portraits showing Stable Diffusion 3.5's ability to represent diverse human features

Stylistic versatility is also a defining feature. Stable Diffusion 3.5 Turbo can execute a spectrum of visual genres, from photorealistic rendering and digital painting to stylized illustrations and anime-inspired art. This breadth allows creators and researchers to tailor their outputs for a variety of applications.

Series of output images in etched illustration, photography, and anime styles, demonstrating the model’s artistic range

In addition to classical and photographic styles, the model is capable of incorporating nuanced typography and handling complex, multi-faceted prompts with improved comprehension compared to prior generations.

Training Data and Related Models

Stable Diffusion 3.5 Turbo was trained using a blend of curated publicly available datasets and synthetic image-text pairs. This strategy aids in supporting diversity, generalization, and content safety. The overall Stable Diffusion 3.5 family includes other notable variants such as Stable Diffusion 3.5 Large, a powerful 8.1 billion parameter model well-suited for high-resolution professional use, and Stable Diffusion 3.5 Medium, a 2.5 billion parameter version focused on lower VRAM requirements and rapid customization.

Compilation of three AI-generated images with accompanying prompts, demonstrating prompt-to-image ability

Limitations and Responsible Use

While Stable Diffusion 3.5 Turbo advances generative image capabilities, certain limitations are inherent to its design and training process. Output diversity is prioritized, which can introduce variability in results for similar prompts, especially in the absence of detailed guidance. The model is intended for creative generation rather than factual representation, and its outputs are not designed to be accurate depictions of real events or individuals. Datasets have been filtered to reduce the presence of potentially harmful or inappropriate content, but comprehensive content safety is not guaranteed, and users are encouraged to implement their own safeguards.

The model's intended use includes artistic synthesis, educational and creative tool development, and academic research into generative methods. Developers and deployers are urged to follow privacy regulations and ethical standards, particularly regarding user data and generated content.

Licensing and Accessibility

All Stable Diffusion 3.5 models, including Turbo, are provided under the Stability AI Community License. This permissive license enables free use for non-commercial purposes, including research and education, and for commercial purposes by organizations with annual revenues up to $1 million. Generated media remain the property of their creators, without restrictive downstream licensing. Uses must comply with Stability AI's Acceptable Use Policy.