Launch a dedicated cloud GPU server running Laboratory OS to download and run ControlNet SDXL Depth using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Forge is a platform built on top of Stable Diffusion WebUI to make development easier, optimize resource management, speed up inference, and study experimental features.
Train your own LoRAs and finetunes for Stable Diffusion and Flux using this popular GUI for the Kohya trainers.
Model Report
stabilityai / ControlNet SDXL Depth
ControlNet SDXL Depth is a conditional control model that enables depth map-guided image generation using the Stable Diffusion XL framework. The model processes depth information from various sources including MiDaS, Leres, and Zoe Depth methods to constrain image synthesis according to spatial relationships and three-dimensional scene structure, allowing users to generate images that adhere to specific geometric arrangements while maintaining the base diffusion model's generative capabilities.
Explore the Future of AI
Your server, your data, under your control
The depth-specific variant of ControlNet 1.1 for Stable Diffusion 1.5, formally known as control_v11f1p_sd15_depth, is a generative artificial intelligence model that enables the targeted guidance of Stable Diffusion 1.5 using depth maps. As part of the ControlNet 1.1 release, it preserves the architecture of previous ControlNet models while introducing improvements in robustness and quality that facilitate more accurate image synthesis conditioned on depth estimations. By integrating depth information, this model extends the capabilities of diffusion models to produce images that adhere closely to the spatial constraints and three-dimensional cues embedded in input depth maps.
Visualization of depth map-based image generation using a grayscale depth map and resulting portrait outputs. Prompt: Women’s portrait synthesis with grayscale depth control.
This model is grounded in the ControlNet 1.1 architecture, closely following the network design of ControlNet 1.0. The model functions as a conditional control system layered upon the Stable Diffusion framework, providing additional input channels that align the generative process with external structural cues, particularly those encoded in depth maps.
Depth maps, which may originate from monocular depth estimation algorithms or real-world 3D renderers, are preprocessed and entered into the model. The depth-aware conditioning enables the diffusion process to respect the foreground-background relationships, occlusion order, and relative distances expressed in the map. This is achieved without architectural divergence from earlier ControlNet versions, ensuring model stability and reproducibility as the developers deliberately deferred major changes until future releases.
ControlNet's codebase is primarily implemented in Python, facilitating modifiability and integration within established diffusion pipelines.
Training Data and Methodology
The depth-specific variant of ControlNet 1.1 was trained using a multi-source dataset that amalgamates depth maps produced by MiDaS, Leres, and Zoe Depth methods. Training incorporated data augmentation strategies, including random left-right flipping, and utilized depth maps across multiple input resolutions (256, 384, and 512 pixels) to bolster generalization to different scale and source variations.
Key improvements over prior models address several data quality challenges. Corrections to the training set eliminated duplicated grayscale images and low-quality samples, while also refining the correspondence between images and textual prompts. These enhancements yielded a model that is not tuned to any singular depth estimation method, resulting in robust performance even when driven by depth maps from novel sources or varying preprocessing pipelines.
This model principally serves to guide the image generation capabilities of Stable Diffusion 1.5 according to the constraints imposed by input depth maps. The resulting system is able to generate images consistent with the geometric arrangement and spatial relationships of objects specified by the depth input. This is particularly useful in tasks requiring fidelity to three-dimensional scene layout, such as the generation of photorealistic portraits, interior renderings, and arbitrary scene synthesis from sketched or programmatically derived depth cues.
Depth-based control is one modality among several in the broader ControlNet 1.1 family of models, which encompasses additional control types including edge, normal map, scribble, line art, soft edge, segmentation, pose (OpenPose), inpainting, and stylization controls.
Efforts towards more resource-efficient deployment are realized in the form of Control-LoRA variants, which employ low-rank adaptation to reduce model size from approximately 4.7GB to as little as 377MB. This enables depth, edge, and stylization controls to be accessible on consumer hardware with a reduced computational footprint. These LoRA-based models maintain the core functionalities of their larger ControlNet counterparts, offering similar user control while facilitating broader accessibility.
Limitations and Considerations
While this model generally provides robust depth conditioning, certain modalities in the ControlNet family—such as the experimental instruct-based ip2p and the stylization-oriented shuffle—may require user discretion and iterative refinement for optimal results. Additionally, some specialized models like the anime-focused line art variant (control_v11p_sd15s2_lineart_anime.pth) are contingent on specific model checkpoints and do not support all operational modes present in other controls.
The license governing this model and its family of models has not been explicitly detailed in the available documentation.
External Resources
For further technical insights, code, and downloads: