Back to Diffusers

Token merging

docs/source/en/optimization/tome.md

0.37.16.6 KB
Original Source
<!--Copyright 2025 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

Token merging

Token merging (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [StableDiffusionPipeline].

Install ToMe from pip:

bash
pip install tomesd

You can use ToMe from the tomesd library with the apply_patch function:

diff
  from diffusers import StableDiffusionPipeline
  import torch
  import tomesd

  pipeline = StableDiffusionPipeline.from_pretrained(
        "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
  ).to("cuda")
+ tomesd.apply_patch(pipeline, ratio=0.5)

  image = pipeline("a photo of an astronaut riding a horse on mars").images[0]

The apply_patch function exposes a number of arguments to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is ratio which controls the number of tokens that are merged during the forward pass.

As reported in the paper, ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the ratio, you can speed-up inference even further, but at the cost of some degraded image quality.

To test the quality of the generated images, we sampled a few prompts from Parti Prompts and performed inference with the [StableDiffusionPipeline] with the following settings:

<div class="flex justify-center"> </div>

We didn’t notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this WandB report. If you're interested in reproducing this experiment, use this script.

Benchmarks

We also benchmarked the impact of tomesd on the [StableDiffusionPipeline] with xFormers enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment:

bash
- `diffusers` version: 0.15.1
- Python version: 3.8.16
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- Huggingface_hub version: 0.13.2
- Transformers version: 4.27.2
- Accelerate version: 0.18.0
- xFormers version: 0.0.16
- tomesd version: 0.1.2

To reproduce this benchmark, feel free to use this script. The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers.

GPUResolutionBatch sizeVanillaToMeToMe + xFormers
A100512106.885.26 (+23.55%)4.69 (+31.83%)
76810OOM14.7111
8OOM11.568.84
4OOM5.984.66
24.993.24 (+35.07%)2.1 (+37.88%)
13.292.24 (+31.91%)2.03 (+38.3%)
102410OOMOOMOOM
8OOMOOMOOM
4OOM12.519.09
2OOM6.524.96
16.43.61 (+43.59%)2.81 (+56.09%)
V10051210OOM10.039.29
8OOM8.057.47
45.74.3 (+24.56%)3.98 (+30.18%)
23.142.43 (+22.61%)2.27 (+27.71%)
11.881.57 (+16.49%)1.57 (+16.49%)
76810OOMOOM23.67
8OOMOOM18.81
4OOM11.819.7
2OOM6.275.2
15.433.38 (+37.75%)2.82 (+48.07%)
102410OOMOOMOOM
8OOMOOMOOM
4OOMOOM19.35
2OOM1310.78
1OOM6.665.54

As seen in the tables above, the speed-up from tomesd becomes more pronounced for larger image resolutions. It is also interesting to note that with tomesd, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with torch.compile.