ControlNet SD 1.5 Depth | Open Laboratory

ControlNet SD 1.5 Depth is a conditional generative model designed to guide image synthesis with Stable Diffusion through the use of grayscale depth maps. As part of the ControlNet 1.1 family, it maintains compatibility with earlier versions while offering enhanced robustness, improved result consistency, and support for a wide range of depth-based inputs. The model enables fine-grained spatial control in the image generation process, especially when integrating 3D information, making it broadly applicable for fields such as graphics, content creation, and 3D rendering workflows.

Depth-based image generation outputs of ControlNet SD 1.5 Depth

Technical Capabilities

The central feature of ControlNet SD 1.5 Depth is its ability to steer image synthesis using externally provided grayscale depth maps. This approach utilizes depth estimation—where lighter regions in the depth map are closer and darker ones are further—to constrain the output spatial structure, enabling the generation of realistic images that remain faithful to the underlying geometry of the input. The model accepts various formats of real and estimated depth maps, demonstrating robustness across different sources including those generated by 3D rendering engines and state-of-the-art monocular depth estimators.

To enhance performance and generalizability, ControlNet SD 1.5 Depth employs extensive data augmentation strategies in training. This includes random mirroring, resolution-matched datasets, and diverse depth map generators to avoid overfitting to a single pre-processing approach. This methodology enables the model to work effectively even when high-quality, real-world depth data is unavailable or inconsistent, as detailed in the official ControlNet documentation.

A notable variant is Control-LoRA, which utilizes Low-Rank Adaptation techniques to reduce the storage size of the depth model, thereby compressing the model from 4.7GB to as little as 377MB, according to the Stable Diffusion Control-LoRA documentation.

Control-LoRA: Depth map inputs and image outputs

Model Architecture and Training

ControlNet SD 1.5 Depth is architecturally aligned with the broader ControlNet 1.1 suite, employing the same foundational neural network design as ControlNet 1.0. A distinctive aspect is its integration of depth maps as conditional inputs alongside conventional Stable Diffusion prompts, achieved via a mechanism that injects depth information at multiple levels of the image synthesizer's pipeline. Architecturally, for certain ControlNet models (such as Shuffle and potentially others), a global average pooling layer is introduced between the encoder and UNet stages, as elaborated in the technical guidance.

The standard model (control_v11f1p_sd15_depth.pth) was trained using a multi-resolution combination of depth labels produced by several monocular estimation methods: MIMAS, Leres, and Zoe, across resolutions from 256 to 512 pixels. This heterogeneity in training data serves as augmentation and regularization, bolstering the model's ability to generalize across unseen depth map distributions. Improvements over ControlNet 1.0 include removal of biases from earlier datasets—such as duplicated images, incorrect pairings, and low-quality data—to ensure that the model does not overfit to artifacts of any single depth estimation system.

Fine-tuning for the Control-LoRA variant involved initial training on the MiDaS depth estimator results, complemented by further optimization using the ClipDrop Portrait Depth Estimation method, as described in the corresponding ClipDrop documentation.

Applications and Use Cases

ControlNet SD 1.5 Depth is applied wherever conditional, geometry-aware image synthesis is required. The principal scenario involves generating novel images with Stable Diffusion while preserving spatial configuration dictated by a supplied depth map, enabling applications such as photo-realistic rendering, character generation with precise 3D-aware poses, and content manipulation where depth continuity is essential.

The model's robustness allows its deployment in pipelines that incorporate real depth information directly from 3D graphics engines. This enables workflows where digital assets created in 3D can be transformed into photorealistic 2D images or stylized interpretations, all while maintaining accurate structural correspondences thanks to the use of depth. Beyond this, the model is frequently utilized in creative settings, virtual production, and prototyping within fields like gaming, cinematography, and virtual reality, as described in the ControlNet GitHub documentation.

Limitations and Considerations

Several caveats should be considered when employing ControlNet SD 1.5 Depth. Early in its release, an unconverged intermediate checkpoint (control_v11p_sd15_depth) was briefly made available, which could result in distorted or suboptimal outputs; this issue was rectified with the bug-fixed model checkpoint (control_v11f1p_sd15_depth). The Control-LoRA variant offers efficiency improvements when compared to the full model.

Although depth-based augmentation has improved performance, the relative benefit may be limited when employing high-quality depth maps at mid-range resolutions, particularly those generated by the MiDaS pipeline, according to commentary in the release notes. Compatibility is maintained with the Stable Diffusion 1.5 backbone and a variety of other ControlNet modules via extensible interfaces, but optimal utility is contingent on high-quality, well-aligned depth input.

Model Family and Related Variants

ControlNet 1.1 encompasses a diverse family of models each tailored to different forms of conditional control, such as edge maps, normal maps, pose estimation, and line art, all working seamlessly within Stable Diffusion's architecture. The depth variant distinguishes itself through its reliance on spatial structure provided by depth estimation, while other models, including those for canny edges, segmentation, and normal maps, approach image conditioning from alternative representation perspectives.

The ControlNet architecture supports simultaneous use of multiple conditioning types, allowing depth maps to be combined with other controls for more sophisticated image synthesis pipelines, as detailed in the multi-ControlNet documentation.

Release History and Licensing

ControlNet SD 1.5 Depth was introduced as part of the ControlNet 1.1 series, with significant updates and bug fixes finalized in April 2023. The official model is continually maintained and updated in the ControlNet GitHub repository, which also hosts ongoing development and community support.

The licensing of the ControlNet-v1-1-nightly release and its models specifies use for research and academic experiments. No explicit open-source license is detailed, and users are advised to consult the repository for current usage policies.

External Resources