Back to Diffusers

T-GATE

docs/source/en/optimization/tgate.md

0.37.16.5 KB
Original Source

T-GATE

T-GATE accelerates inference for Stable Diffusion, PixArt, and Latency Consistency Model pipelines by skipping the cross-attention calculation once it converges. This method doesn't require any additional training and it can speed up inference from 10-50%. T-GATE is also compatible with other optimization methods like DeepCache.

Before you begin, make sure you install T-GATE.

bash
pip install tgate
pip install -U torch diffusers transformers accelerate DeepCache

To use T-GATE with a pipeline, you need to use its corresponding loader.

PipelineT-GATE Loader
PixArtTgatePixArtLoader
Stable Diffusion XLTgateSDXLLoader
Stable Diffusion XL + DeepCacheTgateSDXLDeepCacheLoader
Stable DiffusionTgateSDLoader
Stable Diffusion + DeepCacheTgateSDDeepCacheLoader

Next, create a TgateLoader with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps. Then call the tgate method on the pipeline with a prompt, gate step, and the number of inference steps.

Let's see how to enable this for several different pipelines.

<hfoptions id="pipelines"> <hfoption id="PixArt">

Accelerate PixArtAlphaPipeline with T-GATE:

py
import torch
from diffusers import PixArtAlphaPipeline
from tgate import TgatePixArtLoader

pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)

gate_step = 8
inference_step = 25
pipe = TgatePixArtLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
).to("cuda")

image = pipe.tgate(
       "An alpaca made of colorful building blocks, cyberpunk.",
       gate_step=gate_step,
       num_inference_steps=inference_step,
).images[0]
</hfoption> <hfoption id="Stable Diffusion XL">

Accelerate StableDiffusionXLPipeline with T-GATE:

py
import torch
from diffusers import StableDiffusionXLPipeline
from diffusers import DPMSolverMultistepScheduler
from tgate import TgateSDXLLoader

pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

gate_step = 10
inference_step = 25
pipe = TgateSDXLLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
).to("cuda")

image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]
</hfoption> <hfoption id="StableDiffusionXL with DeepCache">

Accelerate StableDiffusionXLPipeline with DeepCache and T-GATE:

py
import torch
from diffusers import StableDiffusionXLPipeline
from diffusers import DPMSolverMultistepScheduler
from tgate import TgateSDXLDeepCacheLoader

pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

gate_step = 10
inference_step = 25
pipe = TgateSDXLDeepCacheLoader(
       pipe,
       cache_interval=3,
       cache_branch_id=0,
).to("cuda")

image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]
</hfoption> <hfoption id="Latent Consistency Model">

Accelerate latent-consistency/lcm-sdxl with T-GATE:

py
import torch
from diffusers import StableDiffusionXLPipeline
from diffusers import UNet2DConditionModel, LCMScheduler
from diffusers import DPMSolverMultistepScheduler
from tgate import TgateSDXLLoader

unet = UNet2DConditionModel.from_pretrained(
    "latent-consistency/lcm-sdxl",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    unet=unet,
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

gate_step = 1
inference_step = 4
pipe = TgateSDXLLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
       lcm=True
).to("cuda")

image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]
</hfoption> </hfoptions>

T-GATE also supports [StableDiffusionPipeline] and PixArt-alpha/PixArt-LCM-XL-2-1024-MS.

Benchmarks

ModelMACsParamLatencyZero-shot 10K-FID on MS-COCO
SD-1.516.938T859.520M7.032s23.927
SD-1.5 w/ T-GATE9.875T815.557M4.313s20.789
SD-2.138.041T865.785M16.121s22.609
SD-2.1 w/ T-GATE22.208T815.433 M9.878s19.940
SD-XL149.438T2.570B53.187s24.628
SD-XL w/ T-GATE84.438T2.024B27.932s22.738
Pixart-Alpha107.031T611.350M61.502s38.669
Pixart-Alpha w/ T-GATE65.318T462.585M37.867s35.825
DeepCache (SD-XL)57.888T-19.931s23.755
DeepCache w/ T-GATE43.868T-14.666s23.999
LCM (SD-XL)11.955T2.570B3.805s25.044
LCM w/ T-GATE11.171T2.024B3.533s25.028
LCM (Pixart-Alpha)8.563T611.350M4.733s36.086
LCM w/ T-GATE7.623T462.585M4.543s37.048

The latency is tested on an NVIDIA 1080TI, MACs and Params are calculated with calflops, and the FID is calculated with PytorchFID.