Back to Diffusers

Remote inference

docs/source/en/hybrid_inference/overview.md

0.37.116.7 KB
Original Source
<!--Copyright 2025 The HuggingFace Team. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

Remote inference

[!TIP] This is currently an experimental feature, and if you have any feedback, please feel free to leave it here.

Remote inference offloads the decoding and encoding process to a remote endpoint to relax the memory requirements for local inference with large models. This feature is powered by Inference Endpoints. Refer to the table below for the supported models and endpoint.

ModelEndpointCheckpointSupport
Stable Diffusion v1https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloudstabilityai/sd-vae-ft-mseencode/decode
Stable Diffusion XLhttps://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloudmadebyollin/sdxl-vae-fp16-fixencode/decode
Fluxhttps://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloudblack-forest-labs/FLUX.1-schnellencode/decode
HunyuanVideohttps://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloudhunyuanvideo-community/HunyuanVideodecode

This guide will show you how to encode and decode latents with remote inference.

Encoding

Encoding converts images and videos into latent representations. Refer to the table below for the supported VAEs.

Pass an image to [~utils.remote_encode] to encode it. The specific scaling_factor and shift_factor values for each model can be found in the Remote inference API reference.

py
import torch
from diffusers import FluxPipeline
from diffusers.utils import load_image
from diffusers.utils.remote_utils import remote_encode

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.float16,
    vae=None,
    device_map="cuda"
)

init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
init_image = init_image.resize((768, 512))

init_latent = remote_encode(
    endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud",
    image=init_image,
    scaling_factor=0.3611,
    shift_factor=0.1159
)

Decoding

Decoding converts latent representations back into images or videos. Refer to the table below for the available and supported VAEs.

Set the output type to "latent" in the pipeline and set the vae to None. Pass the latents to the [~utils.remote_decode] function. For Flux, the latents are packed so the height and width also need to be passed. The specific scaling_factor and shift_factor values for each model can be found in the Remote inference API reference.

<hfoptions id="decode"> <hfoption id="Flux">
py
from diffusers import FluxPipeline

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.bfloat16,
    vae=None,
    device_map="cuda"
)

prompt = """
A photorealistic Apollo-era photograph of a cat in a small astronaut suit with a bubble helmet, standing on the Moon and holding a flagpole planted in the dusty lunar soil. The flag shows a colorful paw-print emblem. Earth glows in the black sky above the stark gray surface, with sharp shadows and high-contrast lighting like vintage NASA photos.
"""

latent = pipeline(
    prompt=prompt,
    guidance_scale=0.0,
    num_inference_steps=4,
    output_type="latent",
).images
image = remote_decode(
    endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
    tensor=latent,
    height=1024,
    width=1024,
    scaling_factor=0.3611,
    shift_factor=0.1159,
)
image.save("image.jpg")
</hfoption> <hfoption id="HunyuanVideo">
py
import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel

transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
)
pipeline = HunyuanVideoPipeline.from_pretrained(
    model_id, transformer=transformer, vae=None, torch_dtype=torch.float16, device_map="cuda"
)

latent = pipeline(
    prompt="A cat walks on the grass, realistic",
    height=320,
    width=512,
    num_frames=61,
    num_inference_steps=30,
    output_type="latent",
).frames

video = remote_decode(
    endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
    tensor=latent,
    output_type="mp4",
)

if isinstance(video, bytes):
    with open("video.mp4", "wb") as f:
        f.write(video)
</hfoption> </hfoptions>

Queuing

Remote inference supports queuing to process multiple generation requests. While the current latent is being decoded, you can queue the next prompt.

py
import queue
import threading
from IPython.display import display
from diffusers import StableDiffusionXLPipeline

def decode_worker(q: queue.Queue):
    while True:
        item = q.get()
        if item is None:
            break
        image = remote_decode(
            endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
            tensor=item,
            scaling_factor=0.13025,
        )
        display(image)
        q.task_done()

q = queue.Queue()
thread = threading.Thread(target=decode_worker, args=(q,), daemon=True)
thread.start()

def decode(latent: torch.Tensor):
    q.put(latent)

prompts = [
    "A grainy Apollo-era style photograph of a cat in a snug astronaut suit with a bubble helmet, standing on the lunar surface and gripping a flag with a paw-print emblem. The gray Moon landscape stretches behind it, Earth glowing vividly in the black sky, shadows crisp and high-contrast.",
    "A vintage 1960s sci-fi pulp magazine cover illustration of a heroic cat astronaut planting a flag on the Moon. Bold, saturated colors, exaggerated space gear, playful typography floating in the background, Earth painted in bright blues and greens.",
    "A hyper-detailed cinematic shot of a cat astronaut on the Moon holding a fluttering flag, fur visible through the helmet glass, lunar dust scattering under its feet. The vastness of space and Earth in the distance create an epic, awe-inspiring tone.",
    "A colorful cartoon drawing of a happy cat wearing a chunky, oversized spacesuit, proudly holding a flag with a big paw print on it. The Moon’s surface is simplified with craters drawn like doodles, and Earth in the sky has a smiling face.",
    "A monochrome 1969-style press photo of a “first cat on the Moon” moment. The cat, in a tiny astronaut suit, stands by a planted flag, with grainy textures, scratches, and a blurred Earth in the background, mimicking old archival space photos."
]


pipeline = StableDiffusionXLPipeline.from_pretrained(
    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    vae=None,
    device_map="cuda"
)

pipeline.unet = pipeline.unet.to(memory_format=torch.channels_last)
pipeline.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

_ = pipeline(
    prompt=prompts[0],
    output_type="latent",
)

for prompt in prompts:
    latent = pipeline(
        prompt=prompt,
        output_type="latent",
    ).images
    decode(latent)

q.put(None)
thread.join()

Benchmarks

The tables demonstrate the memory requirements for encoding and decoding with Stable Diffusion v1.5 and SDXL on different GPUs.

For the majority of these GPUs, the memory usage dictates whether other models (text encoders, UNet/transformer) need to be offloaded or required tiled encoding. The latter two techniques increases inference time and impacts quality.

<details><summary>Encoding - Stable Diffusion v1.5</summary>
GPUResolutionTime (seconds)Memory (%)Tiled Time (secs)Tiled Memory (%)
NVIDIA GeForce RTX 4090512x5120.0153.519010.0153.51901
NVIDIA GeForce RTX 4090256x2560.0041.31540.0051.3154
NVIDIA GeForce RTX 40902048x20480.40247.18520.4963.51901
NVIDIA GeForce RTX 40901024x10240.07812.26580.0943.51901
NVIDIA GeForce RTX 4080 SUPER512x5120.0235.301050.0235.30105
NVIDIA GeForce RTX 4080 SUPER256x2560.0061.981520.0061.98152
NVIDIA GeForce RTX 4080 SUPER2048x20480.57471.080.6565.30105
NVIDIA GeForce RTX 4080 SUPER1024x10240.11118.47720.145.30105
NVIDIA GeForce RTX 3090512x5120.0323.527820.0323.52782
NVIDIA GeForce RTX 3090256x2560.011.318690.0091.31869
NVIDIA GeForce RTX 30902048x20480.74247.30330.9543.52782
NVIDIA GeForce RTX 30901024x10240.13612.29650.2073.52782
NVIDIA GeForce RTX 3080512x5120.0368.517610.0368.51761
NVIDIA GeForce RTX 3080256x2560.013.183870.013.18387
NVIDIA GeForce RTX 30802048x20480.86386.74241.1918.51761
NVIDIA GeForce RTX 30801024x10240.15729.68880.2278.51761
NVIDIA GeForce RTX 3070512x5120.05110.69410.05110.6941
NVIDIA GeForce RTX 3070256x2560.0153.997430.0153.99743
NVIDIA GeForce RTX 30702048x20481.21796.0541.48210.6941
NVIDIA GeForce RTX 30701024x10240.22337.27510.32710.6941
</details> <details><summary>Encoding SDXL</summary>
GPUResolutionTime (seconds)Memory Consumed (%)Tiled Time (seconds)Tiled Memory (%)
NVIDIA GeForce RTX 4090512x5120.0294.957070.0294.95707
NVIDIA GeForce RTX 4090256x2560.0072.296660.0072.29666
NVIDIA GeForce RTX 40902048x20480.87366.34520.86315.5649
NVIDIA GeForce RTX 40901024x10240.14215.54790.14315.5479
NVIDIA GeForce RTX 4080 SUPER512x5120.0447.467350.0447.46735
NVIDIA GeForce RTX 4080 SUPER256x2560.013.45970.013.4597
NVIDIA GeForce RTX 4080 SUPER2048x20481.31787.16151.29123.447
NVIDIA GeForce RTX 4080 SUPER1024x10240.21323.42150.21423.4215
NVIDIA GeForce RTX 3090512x5120.0585.656380.0585.65638
NVIDIA GeForce RTX 3090256x2560.0162.450810.0162.45081
NVIDIA GeForce RTX 30902048x20481.75577.82391.61418.4193
NVIDIA GeForce RTX 30901024x10240.26518.40230.26518.4023
NVIDIA GeForce RTX 3080512x5120.06413.65680.06413.6568
NVIDIA GeForce RTX 3080256x2560.0185.917280.0185.91728
NVIDIA GeForce RTX 30802048x2048OOMOOM1.86644.4717
NVIDIA GeForce RTX 30801024x10240.30244.43080.30244.4308
NVIDIA GeForce RTX 3070512x5120.09317.14650.09317.1465
NVIDIA GeForce RTX 3070256x2560.0257.429310.0267.42931
NVIDIA GeForce RTX 30702048x2048OOMOOM2.67455.8355
NVIDIA GeForce RTX 30701024x10240.44355.78410.44355.7841
</details> <details><summary>Decoding - Stable Diffusion v1.5</summary>
GPUResolutionTime (seconds)Memory (%)Tiled Time (secs)Tiled Memory (%)
NVIDIA GeForce RTX 4090512x5120.0315.60%0.031 (0%)5.60%
NVIDIA GeForce RTX 40901024x10240.14820.00%0.301 (+103%)5.60%
NVIDIA GeForce RTX 4080512x5120.058.40%0.050 (0%)8.40%
NVIDIA GeForce RTX 40801024x10240.22430.00%0.356 (+59%)8.40%
NVIDIA GeForce RTX 4070 Ti512x5120.06611.30%0.066 (0%)11.30%
NVIDIA GeForce RTX 4070 Ti1024x10240.28440.50%0.454 (+60%)11.40%
NVIDIA GeForce RTX 3090512x5120.0625.20%0.062 (0%)5.20%
NVIDIA GeForce RTX 30901024x10240.25318.50%0.464 (+83%)5.20%
NVIDIA GeForce RTX 3080512x5120.0712.80%0.070 (0%)12.80%
NVIDIA GeForce RTX 30801024x10240.28645.30%0.466 (+63%)12.90%
NVIDIA GeForce RTX 3070512x5120.10215.90%0.102 (0%)15.90%
NVIDIA GeForce RTX 30701024x10240.42156.30%0.746 (+77%)16.00%
</details> <details><summary>Decoding SDXL</summary>
GPUResolutionTime (seconds)Memory Consumed (%)Tiled Time (seconds)Tiled Memory (%)
NVIDIA GeForce RTX 4090512x5120.05710.00%0.057 (0%)10.00%
NVIDIA GeForce RTX 40901024x10240.25635.50%0.257 (+0.4%)35.50%
NVIDIA GeForce RTX 4080512x5120.09215.00%0.092 (0%)15.00%
NVIDIA GeForce RTX 40801024x10240.40653.30%0.406 (0%)53.30%
NVIDIA GeForce RTX 4070 Ti512x5120.12120.20%0.120 (-0.8%)20.20%
NVIDIA GeForce RTX 4070 Ti1024x10240.51972.00%0.519 (0%)72.00%
NVIDIA GeForce RTX 3090512x5120.10710.50%0.107 (0%)10.50%
NVIDIA GeForce RTX 30901024x10240.45938.00%0.460 (+0.2%)38.00%
NVIDIA GeForce RTX 3080512x5120.12125.60%0.121 (0%)25.60%
NVIDIA GeForce RTX 30801024x10240.52493.00%0.524 (0%)93.00%
NVIDIA GeForce RTX 3070512x5120.18331.80%0.183 (0%)31.80%
NVIDIA GeForce RTX 30701024x10240.79496.40%0.794 (0%)96.40%
</details>

Resources