candle-examples/examples/z_image/README.md
Z-Image is a ~24B parameter text-to-image generation model developed by Alibaba, using flow matching for high-quality image synthesis. ModelScope, HuggingFace.
cargo run --features cuda --example z_image --release -- \
--model turbo \
--prompt "A beautiful landscape with mountains and a lake" \
--width 1024 --height 768 \
--num-steps 8
cargo run --features metal --example z_image --release -- \
--model turbo \
--prompt "A futuristic city at night with neon lights" \
--width 1024 --height 1024 \
--num-steps 9
If you prefer to use locally downloaded weights:
# Download weights first
hf download Tongyi-MAI/Z-Image-Turbo --local-dir weights/Z-Image-Turbo
# Run with local path
cargo run --features cuda --example z_image --release -- \
--model turbo \
--model-path weights/Z-Image-Turbo \
--prompt "A beautiful landscape with mountains and a lake"
| Flag | Description | Default |
|---|---|---|
--model | Model variant to use (turbo) | turbo |
--model-path | Override path to local weights (optional) | Auto-download |
--prompt | The text prompt for image generation | Required |
--negative-prompt | Negative prompt for CFG guidance | "" |
--width | Width of the generated image (must be divisible by 16) | 1024 |
--height | Height of the generated image (must be divisible by 16) | 1024 |
--num-steps | Number of denoising steps | Model default (9 for turbo) |
--guidance-scale | Classifier-free guidance scale | 5.0 |
--seed | Random seed for reproducibility | Random |
--output | Output image filename | z_image_output.png |
--cpu | Use CPU instead of GPU | false |
Image dimensions must be divisible by 16. Valid sizes include:
If an invalid size is provided, the program will suggest valid alternatives.
# Landscape (16:9)
cargo run --features metal --example z_image -r -- \
--model turbo \
--prompt "A serene mountain lake at sunset, photorealistic, 4k" \
--width 1280 --height 720 --num-steps 8
# Portrait (3:4)
cargo run --features metal --example z_image -r -- \
--model turbo \
--prompt "A portrait of a wise elderly scholar, oil painting style" \
--width 768 --height 1024 --num-steps 9
# Square (1:1)
cargo run --features metal --example z_image -r -- \
--model turbo \
--prompt "A cute robot holding a candle, digital art" \
--width 1024 --height 1024 --num-steps 8
The VAE operates with an 8× upsampling factor. Latent dimensions are calculated as:
latent_height = 2 × (image_height ÷ 16)
latent_width = 2 × (image_width ÷ 16)
Z-Image uses 3D Rotary Position Embeddings with axes:
The scheduler uses dynamic shifting based on image sequence length:
mu = BASE_SHIFT + (image_seq_len - BASE_SEQ_LEN) / (MAX_SEQ_LEN - BASE_SEQ_LEN) × (MAX_SHIFT - BASE_SHIFT)
Where BASE_SHIFT=0.5, MAX_SHIFT=1.15, BASE_SEQ_LEN=256, MAX_SEQ_LEN=4096.