Back to Candle

candle-z-image: Text-to-Image Generation with Flow Matching

candle-examples/examples/z_image/README.md

0.10.14.2 KB
Original Source

candle-z-image: Text-to-Image Generation with Flow Matching

Z-Image is a ~24B parameter text-to-image generation model developed by Alibaba, using flow matching for high-quality image synthesis. ModelScope, HuggingFace.

Model Architecture

  • Transformer: 24B parameter DiT with 30 main layers + 2 noise refiner + 2 context refiner
  • Text Encoder: Qwen3-based encoder (outputs second-to-last hidden states)
  • VAE: AutoEncoderKL with diffusers format weights
  • Scheduler: FlowMatchEulerDiscreteScheduler with dynamic shifting

Running the Model

Basic Usage (Auto-download from HuggingFace)

bash
cargo run --features cuda --example z_image --release -- \
    --model turbo \
    --prompt "A beautiful landscape with mountains and a lake" \
    --width 1024 --height 768 \
    --num-steps 8

Using Metal (macOS)

bash
cargo run --features metal --example z_image --release -- \
    --model turbo \
    --prompt "A futuristic city at night with neon lights" \
    --width 1024 --height 1024 \
    --num-steps 9

Using Local Weights

If you prefer to use locally downloaded weights:

bash
# Download weights first
hf download Tongyi-MAI/Z-Image-Turbo --local-dir weights/Z-Image-Turbo

# Run with local path
cargo run --features cuda --example z_image --release -- \
    --model turbo \
    --model-path weights/Z-Image-Turbo \
    --prompt "A beautiful landscape with mountains and a lake"

Command-line Flags

FlagDescriptionDefault
--modelModel variant to use (turbo)turbo
--model-pathOverride path to local weights (optional)Auto-download
--promptThe text prompt for image generationRequired
--negative-promptNegative prompt for CFG guidance""
--widthWidth of the generated image (must be divisible by 16)1024
--heightHeight of the generated image (must be divisible by 16)1024
--num-stepsNumber of denoising stepsModel default (9 for turbo)
--guidance-scaleClassifier-free guidance scale5.0
--seedRandom seed for reproducibilityRandom
--outputOutput image filenamez_image_output.png
--cpuUse CPU instead of GPUfalse

Image Size Requirements

Image dimensions must be divisible by 16. Valid sizes include:

  • ✅ 1024×1024, 1024×768, 768×1024, 512×512, 1280×720, 1920×1088
  • ❌ 1920×1080 (1080 is not divisible by 16)

If an invalid size is provided, the program will suggest valid alternatives.

Performance Notes

  • Turbo Version: Z-Image-Turbo is optimized for fast inference, requiring only 8-9 steps
  • Memory Usage: The 24B model requires significant GPU memory. Reduce image dimensions if encountering OOM errors

Example Outputs

bash
# Landscape (16:9)
cargo run --features metal --example z_image -r -- \
    --model turbo \
    --prompt "A serene mountain lake at sunset, photorealistic, 4k" \
    --width 1280 --height 720 --num-steps 8

# Portrait (3:4)
cargo run --features metal --example z_image -r -- \
    --model turbo \
    --prompt "A portrait of a wise elderly scholar, oil painting style" \
    --width 768 --height 1024 --num-steps 9

# Square (1:1)
cargo run --features metal --example z_image -r -- \
    --model turbo \
    --prompt "A cute robot holding a candle, digital art" \
    --width 1024 --height 1024 --num-steps 8

Technical Details

Latent Space

The VAE operates with an 8× upsampling factor. Latent dimensions are calculated as:

latent_height = 2 × (image_height ÷ 16)
latent_width = 2 × (image_width ÷ 16)

3D RoPE Position Encoding

Z-Image uses 3D Rotary Position Embeddings with axes:

  • Frame (temporal): 32 dims, max 1536 positions
  • Height (spatial): 48 dims, max 512 positions
  • Width (spatial): 48 dims, max 512 positions

Dynamic Timestep Shifting

The scheduler uses dynamic shifting based on image sequence length:

mu = BASE_SHIFT + (image_seq_len - BASE_SEQ_LEN) / (MAX_SEQ_LEN - BASE_SEQ_LEN) × (MAX_SHIFT - BASE_SHIFT)

Where BASE_SHIFT=0.5, MAX_SHIFT=1.15, BASE_SEQ_LEN=256, MAX_SEQ_LEN=4096.