.ai/skills/parity-testing/pitfalls.md
MultivariateNormal.sample() uses the global CPU RNG, not torch.Generator. Must call torch.manual_seed(seed) before each pipeline run. A generator= kwarg won't help.
Many transformers expect int64 timesteps. get_timestep_embedding casts to float, so 745.3 and 745 produce different embeddings. Match the reference's casting.
Parameter names may differ: reference zero_steps=1 (meaning i <= 1, 2 steps) vs target zero_init_steps=2 (meaning step < 2, same thing). Check exact semantics.
patch_size in noise generationIf noise generation depends on patch_size (e.g. sample_block_noise), it must be passed through. Missing it changes noise spatial structure.
Nested loops (stages -> chunks -> timesteps) can shadow variable names. If outer loop uses latents and inner loop also assigns to latents, scoping must match the reference.
Target may compute in float32 where reference used bfloat16. Small per-element diffs (1e-3 to 1e-2) look harmless but can compound catastrophically over iterative processes like denoising loops (see Pitfalls #11 and #13). Before dismissing a precision difference: (a) check whether it feeds into an iterative process, (b) if so, trace the accumulation curve over all iterations to see if it stays bounded or grows exponentially. Only truly non-iterative precision diffs (e.g. in a single-pass encoder) are safe to accept.
Some schedulers accumulate state (e.g. model_outputs in UniPC) that must be cleared between stages.
Standard: self.transformer. Modular: components.transformer. Missing this causes AttributeError.
In multi-stage denoising, the guider's internal state (e.g. zero_init_steps) may need save/restore between stages.
NEVER store converted models in /tmp/ -- temporary directories get wiped on restart. Always save converted checkpoints under a persistent path in the project repo (e.g. models/ltx23-diffusers/).
Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing:
noise = torch.randn(..., dtype=torch.float32, generator=gen)
noise = noise.to(dtype=model_dtype) # bfloat16 -- values get quantized
Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (Pitfall #13) and produces completely washed-out output.
Fix: Match the reference -- generate noise in the model's working dtype:
latent_dtype = self.transformer.dtype # e.g. bfloat16
latents = self.prepare_latents(..., dtype=latent_dtype, ...)
Detection: Encode stage test shows initial latent max_diff of exactly ~1.5e-02. This specific magnitude is the signature of float32->bfloat16 quantization error.
RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly (max_diff up to 2.0). Different modalities may use different position dtypes (e.g. video bfloat16, audio float32) -- check the reference carefully.
In Euler/flow-matching, the velocity formula divides by sigma: v = (latents - pred_x0) / sigma. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (Pitfalls #11, #12) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose.
When debugging parity, don't assume config values match code defaults. The published model checkpoint may override defaults with different values. A wrong assumption about a single config field can send you down hours of debugging in the wrong direction.
The pattern that goes wrong:
param_x has default 1 in the codeparam_x with a default of 11 and apply a "fix" based on thatparam_x: 1000, and so does the published diffusers configPrevention -- config diff first:
# Reference: read from checkpoint metadata (no model loading needed)
from safetensors import safe_open
import json
ref_config = json.loads(safe_open(checkpoint_path, framework="pt").metadata()["config"])
# Diffusers: read from model config
from diffusers import MyModel
diff_model = MyModel.from_pretrained(model_path, subfolder="transformer")
diff_config = dict(diff_model.config)
# Compare all values
for key in sorted(set(list(ref_config.get("transformer", {}).keys()) + list(diff_config.keys()))):
ref_val = ref_config.get("transformer", {}).get(key, "MISSING")
diff_val = diff_config.get(key, "MISSING")
if ref_val != diff_val:
print(f" DIFF {key}: ref={ref_val}, diff={diff_val}")
Run this before writing any hooks, analysis code, or fixes. It takes 30 seconds and catches wrong assumptions immediately.
When debugging divergence -- trace values, don't reason about them: If two implementations diverge, hook the actual intermediate values at the point of divergence rather than reading code to figure out what the values "should" be. Code analysis builds on assumptions; value tracing reveals facts.
The upstream model config may have wrong values for decoder-specific parameters (e.g. upsample_residual, upsample_type). These control whether the decoder uses skip connections in upsampling -- getting them wrong produces severe pixelation or blocky artifacts.
Detection: Feed identical post-loop latents through both decoders. If max pixel diff is large (PSNR < 40 dB) on CPU/float32, it's a real bug, not precision noise. Trace through decoder blocks (conv_in -> mid_block -> up_blocks) to find where divergence starts.
Fix: Correct the config value. Don't edit cached files in ~/.cache/huggingface/ -- either save to a local model directory or open a PR on the upstream repo (see Testing Rule #7).
When doing injection tests (feeding reference tensors into the diffusers pipeline), you must inject every divergent input, including sigmas/timesteps. A common mistake: the preloop checkpoint saves sigmas but the injection code only loads latents and embeddings. The test then runs with different sigma schedules, making it impossible to isolate the real cause.
Prevention: After writing injection code, verify by listing every variable the injected stage consumes and checking each one is either (a) injected from reference, or (b) confirmed identical between pipelines.
When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on.
When using saved tensors for cross-pipeline comparison, always ensure both sets of tensors were captured from the same run configuration (same seed, same config, same code version). Mixing fixtures from different runs (e.g. reference tensors from yesterday, diffusers tensors from today after a code change) creates phantom divergence that wastes debugging time. Regenerate both sides in a single test script execution.