examples/research_projects/lpl/README.md
This directory contains an implementation of Latent Perceptual Loss (LPL) for training Stable Diffusion XL models, based on the paper: Boosting Latent Diffusion with Perceptual Objectives (Berrada et al., 2025). LPL is a perceptual loss that operates in the latent space of a VAE, helping to improve the quality and consistency of generated images by bridging the disconnect between the diffusion model and the autoencoder decoder. The implementation is based on the reference implementation provided by Tariq Berrada.
LPL addresses a key limitation in latent diffusion models (LDMs): the disconnect between the diffusion model training and the autoencoder decoder. While LDMs train in the latent space, they don't receive direct feedback about how well their outputs decode into high-quality images. This can lead to:
LPL works by comparing intermediate features from the VAE decoder between the predicted and target latents. This helps the model learn better perceptual features and can lead to:
The LPL implementation follows the paper's methodology and includes several key features:
Feature Extraction: Extracts intermediate features from the VAE decoder, including:
Feature Normalization: Multiple normalization options as validated in the paper:
default: Normalize each feature map independentlyshared: Cross-normalize features using target statistics (recommended)batch: Batch-wise normalizationOutlier Handling: Optional removal of outliers in feature maps using:
Loss Types:
To use LPL in your training, add the following arguments to your training command:
python examples/research_projects/lpl/train_sdxl_lpl.py \
--use_lpl \
--lpl_weight 1.0 \ # Weight for LPL loss (1.0-2.0 recommended)
--lpl_t_threshold 200 \ # Apply LPL only for timesteps < threshold (high SNR)
--lpl_loss_type mse \ # Loss type: "mse" or "l1"
--lpl_norm_type shared \ # Normalization type: "default", "shared" (recommended), or "batch"
--lpl_pow_law \ # Use power law weighting for layers
--lpl_num_blocks 4 \ # Number of up blocks to use (1-4)
--lpl_remove_outliers \ # Remove outliers in feature maps
--lpl_scale \ # Scale LPL loss by noise level weights
--lpl_start 0 \ # Step to start applying LPL
# ... other training arguments ...
lpl_weight: Controls the strength of the LPL loss relative to the main diffusion loss. Higher values (1.0-2.0) improve quality but may slow training.lpl_t_threshold: LPL is only applied for timesteps below this threshold (high SNR). Lower values (100-200) focus on more important timesteps.lpl_loss_type: Choose between MSE (default) and L1 loss. MSE is recommended for most cases.lpl_norm_type: Feature normalization strategy. "shared" is recommended as it showed best results in the paper.lpl_pow_law: Whether to use power law weighting (2^(-i) for layer i). Recommended for better feature balance.lpl_num_blocks: Number of up blocks to use for feature extraction (1-4). More blocks capture more features but use more memory.lpl_remove_outliers: Whether to remove outliers in feature maps. Recommended for stable training.lpl_scale: Whether to scale LPL loss by noise level weights. Helps focus on more important timesteps.lpl_start: Training step to start applying LPL. Can be used to warm up training.Starting Point (based on paper results):
--use_lpl \
--lpl_weight 1.0 \
--lpl_t_threshold 200 \
--lpl_loss_type mse \
--lpl_norm_type shared \
--lpl_pow_law \
--lpl_num_blocks 4 \
--lpl_remove_outliers \
--lpl_scale
Memory Efficiency:
--gradient_checkpointing for memory efficiency (enabled by default)lpl_num_blocks if memory is constrained (2-3 blocks still give good results)--lpl_scale to focus on more important timestepsQuality vs Speed:
lpl_weight (1.0-2.0) for better qualitylpl_t_threshold (100-200) for faster traininglpl_remove_outliers for more stable traininglpl_norm_type shared provides best quality/speed trade-offThe LPL implementation extracts features from the VAE decoder in the following order:
Each feature map is processed with:
For each feature map:
Based on the paper's findings, LPL provides:
If you use this implementation in your research, please cite:
@inproceedings{berrada2025boosting,
title={Boosting Latent Diffusion with Perceptual Objectives},
author={Tariq Berrada and Pietro Astolfi and Melissa Hall and Marton Havasi and Yohann Benchetrit and Adriana Romero-Soriano and Karteek Alahari and Michal Drozdzal and Jakob Verbeek},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=y4DtzADzd1}
}