Complete Pitfalls Reference

1. Global CPU RNG

MultivariateNormal.sample() uses the global CPU RNG, not torch.Generator. Must call torch.manual_seed(seed) before each pipeline run. A generator= kwarg won't help.

2. Timestep dtype

Many transformers expect int64 timesteps. get_timestep_embedding casts to float, so 745.3 and 745 produce different embeddings. Match the reference's casting.

3. Guidance parameter mapping

Parameter names may differ: reference zero_steps=1 (meaning i <= 1, 2 steps) vs target zero_init_steps=2 (meaning step < 2, same thing). Check exact semantics.

4. `patch_size` in noise generation

If noise generation depends on patch_size (e.g. sample_block_noise), it must be passed through. Missing it changes noise spatial structure.

5. Variable shadowing in nested loops

Nested loops (stages -> chunks -> timesteps) can shadow variable names. If outer loop uses latents and inner loop also assigns to latents, scoping must match the reference.

6. Float precision differences -- don't dismiss them

Target may compute in float32 where reference used bfloat16. Small per-element diffs (1e-3 to 1e-2) look harmless but can compound catastrophically over iterative processes like denoising loops (see Pitfalls #11 and #13). Before dismissing a precision difference: (a) check whether it feeds into an iterative process, (b) if so, trace the accumulation curve over all iterations to see if it stays bounded or grows exponentially. Only truly non-iterative precision diffs (e.g. in a single-pass encoder) are safe to accept.

7. Scheduler state reset between stages

Some schedulers accumulate state (e.g. model_outputs in UniPC) that must be cleared between stages.

8. Component access

Standard: self.transformer. Modular: components.transformer. Missing this causes AttributeError.

9. Guider state across stages

In multi-stage denoising, the guider's internal state (e.g. zero_init_steps) may need save/restore between stages.

10. Model storage location

NEVER store converted models in /tmp/ -- temporary directories get wiped on restart. Always save converted checkpoints under a persistent path in the project repo (e.g. models/ltx23-diffusers/).

11. Noise dtype mismatch (causes washed-out output)

Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing:

python

noise = torch.randn(..., dtype=torch.float32, generator=gen)
noise = noise.to(dtype=model_dtype)  # bfloat16 -- values get quantized

Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (Pitfall #13) and produces completely washed-out output.

Fix: Match the reference -- generate noise in the model's working dtype:

python

latent_dtype = self.transformer.dtype  # e.g. bfloat16
latents = self.prepare_latents(..., dtype=latent_dtype, ...)

Detection: Encode stage test shows initial latent max_diff of exactly ~1.5e-02. This specific magnitude is the signature of float32->bfloat16 quantization error.

12. RoPE position dtype

RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly (max_diff up to 2.0). Different modalities may use different position dtypes (e.g. video bfloat16, audio float32) -- check the reference carefully.

13. 1/sigma error amplification in Euler denoising

In Euler/flow-matching, the velocity formula divides by sigma: v = (latents - pred_x0) / sigma. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (Pitfalls #11, #12) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose.

14. Config value assumptions -- always diff, never assume

When debugging parity, don't assume config values match code defaults. The published model checkpoint may override defaults with different values. A wrong assumption about a single config field can send you down hours of debugging in the wrong direction.

The pattern that goes wrong:

You see param_x has default 1 in the code
The reference code also uses param_x with a default of 1
You assume both sides use 1 and apply a "fix" based on that
But the actual checkpoint config has param_x: 1000, and so does the published diffusers config
Your "fix" now creates divergence instead of fixing it

Prevention -- config diff first:

python

# Reference: read from checkpoint metadata (no model loading needed)
from safetensors import safe_open
import json
ref_config = json.loads(safe_open(checkpoint_path, framework="pt").metadata()["config"])

# Diffusers: read from model config
from diffusers import MyModel
diff_model = MyModel.from_pretrained(model_path, subfolder="transformer")
diff_config = dict(diff_model.config)

# Compare all values
for key in sorted(set(list(ref_config.get("transformer", {}).keys()) + list(diff_config.keys()))):
    ref_val = ref_config.get("transformer", {}).get(key, "MISSING")
    diff_val = diff_config.get(key, "MISSING")
    if ref_val != diff_val:
        print(f"  DIFF {key}: ref={ref_val}, diff={diff_val}")

Run this before writing any hooks, analysis code, or fixes. It takes 30 seconds and catches wrong assumptions immediately.

When debugging divergence -- trace values, don't reason about them: If two implementations diverge, hook the actual intermediate values at the point of divergence rather than reading code to figure out what the values "should" be. Code analysis builds on assumptions; value tracing reveals facts.

15. Decoder config mismatch (causes pixelated artifacts)

The upstream model config may have wrong values for decoder-specific parameters (e.g. upsample_residual, upsample_type). These control whether the decoder uses skip connections in upsampling -- getting them wrong produces severe pixelation or blocky artifacts.

Detection: Feed identical post-loop latents through both decoders. If max pixel diff is large (PSNR < 40 dB) on CPU/float32, it's a real bug, not precision noise. Trace through decoder blocks (conv_in -> mid_block -> up_blocks) to find where divergence starts.

Fix: Correct the config value. Don't edit cached files in ~/.cache/huggingface/ -- either save to a local model directory or open a PR on the upstream repo (see Testing Rule #7).

16. Incomplete injection tests -- inject ALL variables or the test is invalid

When doing injection tests (feeding reference tensors into the diffusers pipeline), you must inject every divergent input, including sigmas/timesteps. A common mistake: the preloop checkpoint saves sigmas but the injection code only loads latents and embeddings. The test then runs with different sigma schedules, making it impossible to isolate the real cause.

Prevention: After writing injection code, verify by listing every variable the injected stage consumes and checking each one is either (a) injected from reference, or (b) confirmed identical between pipelines.

17. bf16 connector/encoder divergence -- don't chase it

When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on.

18. Stale test fixtures

When using saved tensors for cross-pipeline comparison, always ensure both sets of tensors were captured from the same run configuration (same seed, same config, same code version). Mixing fixtures from different runs (e.g. reference tensors from yesterday, diffusers tensors from today after a code change) creates phantom divergence that wastes debugging time. Regenerate both sides in a single test script execution.

Complete Pitfalls Reference

Complete Pitfalls Reference

1. Global CPU RNG

2. Timestep dtype

3. Guidance parameter mapping

4. patch_size in noise generation

5. Variable shadowing in nested loops

6. Float precision differences -- don't dismiss them

7. Scheduler state reset between stages

8. Component access

9. Guider state across stages

10. Model storage location

11. Noise dtype mismatch (causes washed-out output)

12. RoPE position dtype

13. 1/sigma error amplification in Euler denoising

14. Config value assumptions -- always diff, never assume

15. Decoder config mismatch (causes pixelated artifacts)

16. Incomplete injection tests -- inject ALL variables or the test is invalid

17. bf16 connector/encoder divergence -- don't chase it

18. Stale test fixtures

4. `patch_size` in noise generation