Back to Diffusers

Marigold Computer Vision

docs/source/en/api/pipelines/marigold.md

0.37.110.1 KB
Original Source
<!-- Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved. Copyright 2024-2025 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

Marigold Computer Vision

Marigold was proposed in Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation, a CVPR 2024 Oral paper by Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. The core idea is to repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks. This approach was explored by fine-tuning Stable Diffusion for Monocular Depth Estimation, as demonstrated in the teaser above.

Marigold was later extended in the follow-up paper, Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis, authored by Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Konrad Schindler. This work expanded Marigold to support new modalities such as Surface Normals and Intrinsic Image Decomposition (IID), introduced a training protocol for Latent Consistency Models (LCM), and demonstrated High-Resolution (HR) processing capability.

[!TIP] The early Marigold models (v1-0 and earlier) were optimized for best results with at least 10 inference steps. LCM models were later developed to enable high-quality inference in just 1 to 4 steps. Marigold models v1-1 and later use the DDIM scheduler to achieve optimal results in as few as 1 to 4 steps.

Available Pipelines

Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a corresponding prediction. Currently, the following computer vision tasks are implemented:

PipelineRecommended Model CheckpointsSpaces (Interactive Apps)Predicted Modalities
MarigoldDepthPipelineprs-eth/marigold-depth-v1-1Depth EstimationDepth, Disparity
MarigoldNormalsPipelineprs-eth/marigold-normals-v1-1Surface Normals EstimationSurface normals
MarigoldIntrinsicsPipelineprs-eth/marigold-iid-appearance-v1-1,
prs-eth/marigold-iid-lighting-v1-1Intrinsic Image DecompositionAlbedo, Materials, Lighting

Available Checkpoints

All original checkpoints are available under the PRS-ETH organization on Hugging Face. They are designed for use with diffusers pipelines and the original codebase, which can also be used to train new model checkpoints. The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps.

CheckpointModalityComment
prs-eth/marigold-depth-v1-1DepthAffine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference.
prs-eth/marigold-normals-v0-1NormalsThe surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1.
prs-eth/marigold-iid-appearance-v1-1IntrinsicsInteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity.
prs-eth/marigold-iid-lighting-v1-1IntrinsicsHyperSim decomposition of an image $I$ is comprised of Albedo $A$, Diffuse shading $S$, and Non-diffuse residual $R$: $I = A*S+R$.

[!TIP] Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section here.

[!WARNING] Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint. The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases. To accommodate this, the num_inference_steps parameter in the pipeline's __call__ method defaults to None (see the API reference). Unless set explicitly, it inherits the value from the default_denoising_steps field in the checkpoint configuration file (model_index.json). This ensures high-quality predictions when invoking the pipeline with only the image argument.

See also Marigold usage examples.

Marigold Depth Prediction API

[[autodoc]] MarigoldDepthPipeline - call

[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput

[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth

Marigold Normals Estimation API

[[autodoc]] MarigoldNormalsPipeline - call

[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput

[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals

Marigold Intrinsic Image Decomposition API

[[autodoc]] MarigoldIntrinsicsPipeline - call

[[autodoc]] pipelines.marigold.pipeline_marigold_intrinsics.MarigoldIntrinsicsOutput

[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics