Modular pipeline conventions and rules

Shared reference for modular pipeline conventions, patterns, and gotchas.

Common modular conventions

When adding a new modular pipeline (or reviewing one), skim src/diffusers/modular_pipelines/qwenimage/, src/diffusers/modular_pipelines/flux2/, src/diffusers/modular_pipelines/wan/, and src/diffusers/modular_pipelines/helios/ first to establish the pattern. Most conventions (file split between encoders.py / before_denoise.py / denoise.py / decoders.py, how expected_components / inputs / intermediate_outputs are declared, the denoise-loop wrapping with LoopSequentialPipelineBlocks, top-level assembly via AutoPipelineBlocks / SequentialPipelineBlocks in modular_blocks_<model>.py, the ModularPipeline subclass shape, the guider-abstracted denoise body, kwargs_type="denoiser_input_fields" plumbing) are easiest to internalize by comparison rather than from a fixed list.

File structure

src/diffusers/modular_pipelines/<model>/
  __init__.py                          # Lazy imports
  modular_pipeline.py                  # Pipeline class (tiny, mostly config)
  encoders.py                          # Text encoder + image/video VAE encoder blocks
  before_denoise.py                    # Pre-denoise setup blocks (timesteps, latent prep, noise)
  denoise.py                           # The denoising loop blocks
  decoders.py                          # VAE decode block
  modular_blocks_<model>.py            # Block assembly (AutoBlocks)

Block types decision tree

Is this a single operation?
  YES -> ModularPipelineBlocks (leaf block)

Does it run multiple blocks in sequence?
  YES -> SequentialPipelineBlocks
    Does it iterate (e.g. chunk loop)?
      YES -> LoopSequentialPipelineBlocks

Does it choose ONE block based on which input is present?
  Is the selection 1:1 with trigger inputs?
    YES -> AutoPipelineBlocks (simple trigger mapping)
    NO  -> ConditionalPipelineBlocks (custom select_block method)

Build order (easiest first)

decoders.py -- Takes latents, runs VAE decode, returns images/videos
encoders.py -- Takes prompt, returns prompt_embeds. Add image/video VAE encoder if needed
before_denoise.py -- Timesteps, latent prep, noise setup. Each logical operation = one block
denoise.py -- The hardest. Convert guidance to guider abstraction

Key pattern: Guider abstraction

Original pipeline has guidance baked in:

python

for i, t in enumerate(timesteps):
    noise_pred = self.transformer(latents, prompt_embeds, ...)
    if self.do_classifier_free_guidance:
        noise_uncond = self.transformer(latents, negative_prompt_embeds, ...)
        noise_pred = noise_uncond + scale * (noise_pred - noise_uncond)
    latents = self.scheduler.step(noise_pred, t, latents).prev_sample

Modular pipeline separates concerns:

python

guider_inputs = {
    "encoder_hidden_states": (prompt_embeds, negative_prompt_embeds),
}

for i, t in enumerate(timesteps):
    components.guider.set_state(step=i, num_inference_steps=num_steps, timestep=t)
    guider_state = components.guider.prepare_inputs(guider_inputs)

    for batch in guider_state:
        components.guider.prepare_models(components.transformer)
        cond_kwargs = {k: getattr(batch, k) for k in guider_inputs}
        context_name = getattr(batch, components.guider._identifier_key)
        with components.transformer.cache_context(context_name):
            batch.noise_pred = components.transformer(
                hidden_states=latents, timestep=timestep,
                return_dict=False, **cond_kwargs, **shared_kwargs,
            )[0]
        components.guider.cleanup_models(components.transformer)

    noise_pred = components.guider(guider_state)[0]
    latents = components.scheduler.step(noise_pred, t, latents, generator=generator)[0]

Key pattern: Denoising loop

All models use LoopSequentialPipelineBlocks for the denoising loop (iterating over timesteps):

python

class MyModelDenoiseLoopWrapper(LoopSequentialPipelineBlocks):
    block_classes = [LoopBeforeDenoiser, LoopDenoiser, LoopAfterDenoiser]

Autoregressive video models (e.g. Helios) also use it for an outer chunk loop:

python

class HeliosChunkDenoiseStep(HeliosChunkLoopWrapper):
    block_classes = [
        HeliosChunkHistorySliceStep,
        HeliosChunkNoiseGenStep,
        HeliosChunkSchedulerResetStep,
        HeliosChunkDenoiseInner,
        HeliosChunkUpdateStep,
    ]

Note: sub-blocks inside LoopSequentialPipelineBlocks receive (components, block_state, i, t) for denoise loops or (components, block_state, k) for chunk loops.

Key pattern: Workflow selection

python

class AutoDenoise(ConditionalPipelineBlocks):
    block_classes = [V2VDenoiseStep, I2VDenoiseStep, T2VDenoiseStep]
    block_trigger_inputs = ["video_latents", "image_latents"]
    default_block_name = "text2video"

Key pattern: Standalone block reusability

One of the core reason a pipeline is split into blocks at all: each block (text encoder, VAE encoder, prepare-latents, denoise, decoder) must be runnable on its own, and its output must be reusable as the input to a different downstream chain.

Concretely:

The text encoder block returns prompt_embeds. A user can run only that block, save the embeddings, and feed them to the denoise loop later — possibly with a different num_images_per_prompt, possibly across multiple runs.
The VAE encoder is its own block in encoders.py (e.g. WanVaeEncoderStep) returning image_latents. The prepare-latents block accepts image_latents, not raw images, so users can swap in pre-encoded latents.
The decoder block accepts denoised latents from any source — directly from the denoise loop, or after an injected step (upscale, latent edit). Don't bundle decoding into the denoise loop.

Two consequences for input plumbing:

Encoder / VAE-encoder blocks accept raw inputs only (prompt, image, ...) and emit per-prompt outputs (prompt_embeds, image_latents). They do not bake in num_images_per_prompt.
Per-prompt expansion happens in a dedicated input step inside the core denoise sequence (e.g. <Model>TextInputStep). That keeps pre-encoded embeds reusable across runs with different num_images_per_prompt. See qwenimage/before_denoise.py for the canonical input step.

Standard pipelines accept prompt_embeds / image_latents as __call__ inputs so users can skip encoding. In modular pipelines this is unnecessary — users just pop out the encoder block and run it standalone. Don't accept pre-computed encoder outputs as __call__ inputs of an encoder block.

Key pattern: Flat block assembly

Prefer flat sequences over nested compositions. Put the Auto / Conditional selection at the top level and make each workflow variant a flat InsertableDict of leaf blocks. Try not to nest AutoPipelineBlocks inside SequentialPipelineBlocks inside AutoPipelineBlocks — debugging which workflow was selected, and which block inside which sub-block touched which state, becomes painful. See flux2/modular_blocks_flux2_klein.py for the canonical shape.

InputParam / OutputParam

Use .template("<name>") for params with a canonical meaning (prompt, negative_prompt, image, generator, num_inference_steps, latents, prompt_embeds, images, videos, etc.) — the template carries a vetted description and type hint. The full registry lives in src/diffusers/modular_pipelines/modular_pipeline_utils.py (INPUT_PARAM_TEMPLATES, OUTPUT_PARAM_TEMPLATES); read that file rather than relying on a hardcoded list here, since names get added.

For params that don't match a template (model-specific names, custom semantics), declare the field directly:

python

# Inputs
InputParam(
    "text_lens",
    required=True,
    type_hint=torch.Tensor,
    description="Per-prompt text lengths used by the transformer attention mask.",
)

# Outputs
OutputParam(
    "text_bth",
    type_hint=torch.Tensor,
    kwargs_type="denoiser_input_fields",
    description="Padded text hidden states of shape (B, T_max, H) fed into the transformer.",
)

If a template's predefined description doesn't fit (e.g. the "latents" output template means "Denoised latents", which is wrong for the noisy latents out of a prepare-latents step) — drop the template and declare the field directly with an accurate description. See gotcha #5.

ComponentSpec patterns

python

# models (with weights) - loaded from pretrained
ComponentSpec("transformer", YourTransformerModel)
ComponentSpec("vae", AutoencoderKL)

# weightless objects - created inline from config
ComponentSpec(
    "guider",
    ClassifierFreeGuidance,
    config=FrozenDict({"guidance_scale": 7.5}),
    default_creation_method="from_config"
)

Gotchas

Importing from standard pipelines. The modular and standard pipeline systems are parallel — modular blocks must not import from diffusers.pipelines.*. For shared utility methods (e.g. _pack_latents, retrieve_timesteps), either redefine as standalone functions or use # Copied from diffusers.pipelines.<model>... headers. See wan/before_denoise.py and helios/before_denoise.py for examples.
Cross-importing between modular pipelines. Don't import utilities from another model's modular pipeline (e.g. SD3 importing from qwenimage.inputs). If a utility is shared, move it to modular_pipeline_utils.py or copy it with a # Copied from header.
Accepting guidance_scale as a pipeline input. Users configure the guider separately (see guider docs). Different guider types have different parameters; forwarding them through the pipeline doesn't scale. Don't manually set components.guider.guidance_scale = ... inside blocks. Same applies to computing do_classifier_free_guidance — that logic belongs in the guider. Exception: some pipeline only support distilled checkpoints (e.g. distilled Flux) skip CFG entirely and don't carry a guider — guidance_scale is then a real model input, not a guider knob, and accepting it as a pipeline input is fine. If you're reviewing a pipeline that doesn't have a guider in expected_components, flag it explicitly so the choice is intentional.
Instantiating components inline. If a class like VideoProcessor is needed, register it as a ComponentSpec and access via components.video_processor. Don't create new instances inside block __call__.
Using InputParam.template() / OutputParam.template() when semantics don't match. Templates carry predefined descriptions — e.g. the "latents" output template means "Denoised latents". Don't use it for initial noisy latents from a prepare-latents step. Use a plain InputParam(...) / OutputParam(...) with an accurate description instead.
Test model paths pointing to contributor repos. Tiny test models must live under hf-internal-testing/, not personal repos like username/tiny-model. Move the model before merge.
Respect the declared IO system. Components in expected_components, fields in inputs / intermediate_outputs — once declared, the modular framework guarantees them. So:
- Don't read defensively. Declared components are always set as attributes (possibly None); declared upstream outputs are always populated in block_state after the upstream block runs. getattr(components, "vae", None), hasattr(self, "vae"), getattr(block_state, "prompt_embeds", None) are dead code that hides typos. Use components.vae / block_state.prompt_embeds directly. Check is not None only when nullability is meaningful (a component the user might not have loaded).
- Don't write undeclared. If a block sets block_state.foo = ..., declare OutputParam("foo", ...) in intermediate_outputs. The declarations are the public contract — undeclared writes can't be wired to downstream blocks.
- Don't call state.set() directly inside a block. Write to state only through declared intermediate_outputs via self.get_block_state(state) / self.set_block_state(state, block_state). A direct state.set("foo", value) bypasses the block's interface entirely — the field never appears as a declared output, so downstream blocks can't see it through the normal wiring and the framework can't generate docs / validate types for it.
No-op skip logic inside an optional block. If a step is conditional (e.g. an optional prompt enhancer), don't have the block check a flag at the top of __call__ and return early. Wrap it in an AutoPipelineBlocks with block_trigger_inputs = ["use_xxx"] so the block is only assembled into the pipeline when the trigger input is provided. The block's own __call__ should always assume its components and inputs are present.

Modular pipeline conventions and rules

Modular pipeline conventions and rules

Common modular conventions

File structure

Block types decision tree

Build order (easiest first)

Key pattern: Guider abstraction

Key pattern: Denoising loop

Key pattern: Workflow selection

Key pattern: Standalone block reusability

Key pattern: Flat block assembly

InputParam / OutputParam

ComponentSpec patterns

Gotchas

Conversion checklist