Remote inference

[!TIP] This is currently an experimental feature, and if you have any feedback, please feel free to leave it here.

Remote inference offloads the decoding and encoding process to a remote endpoint to relax the memory requirements for local inference with large models. This feature is powered by Inference Endpoints. Refer to the table below for the supported models and endpoint.

Model	Endpoint	Checkpoint	Support
Stable Diffusion v1	https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud	stabilityai/sd-vae-ft-mse	encode/decode
Stable Diffusion XL	https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud	madebyollin/sdxl-vae-fp16-fix	encode/decode
Flux	https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud	black-forest-labs/FLUX.1-schnell	encode/decode
HunyuanVideo	https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud	hunyuanvideo-community/HunyuanVideo	decode

This guide will show you how to encode and decode latents with remote inference.

Encoding

Encoding converts images and videos into latent representations. Refer to the table below for the supported VAEs.

Pass an image to [~utils.remote_encode] to encode it. The specific scaling_factor and shift_factor values for each model can be found in the Remote inference API reference.

import torch
from diffusers import FluxPipeline
from diffusers.utils import load_image
from diffusers.utils.remote_utils import remote_encode

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.float16,
    vae=None,
    device_map="cuda"
)

init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
init_image = init_image.resize((768, 512))

init_latent = remote_encode(
    endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud",
    image=init_image,
    scaling_factor=0.3611,
    shift_factor=0.1159
)

Decoding

Decoding converts latent representations back into images or videos. Refer to the table below for the available and supported VAEs.

Set the output type to "latent" in the pipeline and set the vae to None. Pass the latents to the [~utils.remote_decode] function. For Flux, the latents are packed so the height and width also need to be passed. The specific scaling_factor and shift_factor values for each model can be found in the Remote inference API reference.

from diffusers import FluxPipeline

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell",
    torch_dtype=torch.bfloat16,
    vae=None,
    device_map="cuda"
)

prompt = """
A photorealistic Apollo-era photograph of a cat in a small astronaut suit with a bubble helmet, standing on the Moon and holding a flagpole planted in the dusty lunar soil. The flag shows a colorful paw-print emblem. Earth glows in the black sky above the stark gray surface, with sharp shadows and high-contrast lighting like vintage NASA photos.
"""

latent = pipeline(
    prompt=prompt,
    guidance_scale=0.0,
    num_inference_steps=4,
    output_type="latent",
).images
image = remote_decode(
    endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
    tensor=latent,
    height=1024,
    width=1024,
    scaling_factor=0.3611,
    shift_factor=0.1159,
)
image.save("image.jpg")

</hfoption> <hfoption id="HunyuanVideo">

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel

transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
)
pipeline = HunyuanVideoPipeline.from_pretrained(
    model_id, transformer=transformer, vae=None, torch_dtype=torch.float16, device_map="cuda"
)

latent = pipeline(
    prompt="A cat walks on the grass, realistic",
    height=320,
    width=512,
    num_frames=61,
    num_inference_steps=30,
    output_type="latent",
).frames

video = remote_decode(
    endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
    tensor=latent,
    output_type="mp4",
)

if isinstance(video, bytes):
    with open("video.mp4", "wb") as f:
        f.write(video)

</hfoption> </hfoptions>

Queuing

Remote inference supports queuing to process multiple generation requests. While the current latent is being decoded, you can queue the next prompt.

import queue
import threading
from IPython.display import display
from diffusers import StableDiffusionXLPipeline

def decode_worker(q: queue.Queue):
    while True:
        item = q.get()
        if item is None:
            break
        image = remote_decode(
            endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
            tensor=item,
            scaling_factor=0.13025,
        )
        display(image)
        q.task_done()

q = queue.Queue()
thread = threading.Thread(target=decode_worker, args=(q,), daemon=True)
thread.start()

def decode(latent: torch.Tensor):
    q.put(latent)

prompts = [
    "A grainy Apollo-era style photograph of a cat in a snug astronaut suit with a bubble helmet, standing on the lunar surface and gripping a flag with a paw-print emblem. The gray Moon landscape stretches behind it, Earth glowing vividly in the black sky, shadows crisp and high-contrast.",
    "A vintage 1960s sci-fi pulp magazine cover illustration of a heroic cat astronaut planting a flag on the Moon. Bold, saturated colors, exaggerated space gear, playful typography floating in the background, Earth painted in bright blues and greens.",
    "A hyper-detailed cinematic shot of a cat astronaut on the Moon holding a fluttering flag, fur visible through the helmet glass, lunar dust scattering under its feet. The vastness of space and Earth in the distance create an epic, awe-inspiring tone.",
    "A colorful cartoon drawing of a happy cat wearing a chunky, oversized spacesuit, proudly holding a flag with a big paw print on it. The Moon’s surface is simplified with craters drawn like doodles, and Earth in the sky has a smiling face.",
    "A monochrome 1969-style press photo of a “first cat on the Moon” moment. The cat, in a tiny astronaut suit, stands by a planted flag, with grainy textures, scratches, and a blurred Earth in the background, mimicking old archival space photos."
]


pipeline = StableDiffusionXLPipeline.from_pretrained(
    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    vae=None,
    device_map="cuda"
)

pipeline.unet = pipeline.unet.to(memory_format=torch.channels_last)
pipeline.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

_ = pipeline(
    prompt=prompts[0],
    output_type="latent",
)

for prompt in prompts:
    latent = pipeline(
        prompt=prompt,
        output_type="latent",
    ).images
    decode(latent)

q.put(None)
thread.join()

Benchmarks

The tables demonstrate the memory requirements for encoding and decoding with Stable Diffusion v1.5 and SDXL on different GPUs.

For the majority of these GPUs, the memory usage dictates whether other models (text encoders, UNet/transformer) need to be offloaded or required tiled encoding. The latter two techniques increases inference time and impacts quality.

<details><summary>Encoding - Stable Diffusion v1.5</summary>

GPU	Resolution	Time (seconds)	Memory (%)	Tiled Time (secs)	Tiled Memory (%)
NVIDIA GeForce RTX 4090	512x512	0.015	3.51901	0.015	3.51901
NVIDIA GeForce RTX 4090	256x256	0.004	1.3154	0.005	1.3154
NVIDIA GeForce RTX 4090	2048x2048	0.402	47.1852	0.496	3.51901
NVIDIA GeForce RTX 4090	1024x1024	0.078	12.2658	0.094	3.51901
NVIDIA GeForce RTX 4080 SUPER	512x512	0.023	5.30105	0.023	5.30105
NVIDIA GeForce RTX 4080 SUPER	256x256	0.006	1.98152	0.006	1.98152
NVIDIA GeForce RTX 4080 SUPER	2048x2048	0.574	71.08	0.656	5.30105
NVIDIA GeForce RTX 4080 SUPER	1024x1024	0.111	18.4772	0.14	5.30105
NVIDIA GeForce RTX 3090	512x512	0.032	3.52782	0.032	3.52782
NVIDIA GeForce RTX 3090	256x256	0.01	1.31869	0.009	1.31869
NVIDIA GeForce RTX 3090	2048x2048	0.742	47.3033	0.954	3.52782
NVIDIA GeForce RTX 3090	1024x1024	0.136	12.2965	0.207	3.52782
NVIDIA GeForce RTX 3080	512x512	0.036	8.51761	0.036	8.51761
NVIDIA GeForce RTX 3080	256x256	0.01	3.18387	0.01	3.18387
NVIDIA GeForce RTX 3080	2048x2048	0.863	86.7424	1.191	8.51761
NVIDIA GeForce RTX 3080	1024x1024	0.157	29.6888	0.227	8.51761
NVIDIA GeForce RTX 3070	512x512	0.051	10.6941	0.051	10.6941
NVIDIA GeForce RTX 3070	256x256	0.015	3.99743	0.015	3.99743
NVIDIA GeForce RTX 3070	2048x2048	1.217	96.054	1.482	10.6941
NVIDIA GeForce RTX 3070	1024x1024	0.223	37.2751	0.327	10.6941

</details> <details><summary>Encoding SDXL</summary>

GPU	Resolution	Time (seconds)	Memory Consumed (%)	Tiled Time (seconds)	Tiled Memory (%)
NVIDIA GeForce RTX 4090	512x512	0.029	4.95707	0.029	4.95707
NVIDIA GeForce RTX 4090	256x256	0.007	2.29666	0.007	2.29666
NVIDIA GeForce RTX 4090	2048x2048	0.873	66.3452	0.863	15.5649
NVIDIA GeForce RTX 4090	1024x1024	0.142	15.5479	0.143	15.5479
NVIDIA GeForce RTX 4080 SUPER	512x512	0.044	7.46735	0.044	7.46735
NVIDIA GeForce RTX 4080 SUPER	256x256	0.01	3.4597	0.01	3.4597
NVIDIA GeForce RTX 4080 SUPER	2048x2048	1.317	87.1615	1.291	23.447
NVIDIA GeForce RTX 4080 SUPER	1024x1024	0.213	23.4215	0.214	23.4215
NVIDIA GeForce RTX 3090	512x512	0.058	5.65638	0.058	5.65638
NVIDIA GeForce RTX 3090	256x256	0.016	2.45081	0.016	2.45081
NVIDIA GeForce RTX 3090	2048x2048	1.755	77.8239	1.614	18.4193
NVIDIA GeForce RTX 3090	1024x1024	0.265	18.4023	0.265	18.4023
NVIDIA GeForce RTX 3080	512x512	0.064	13.6568	0.064	13.6568
NVIDIA GeForce RTX 3080	256x256	0.018	5.91728	0.018	5.91728
NVIDIA GeForce RTX 3080	2048x2048	OOM	OOM	1.866	44.4717
NVIDIA GeForce RTX 3080	1024x1024	0.302	44.4308	0.302	44.4308
NVIDIA GeForce RTX 3070	512x512	0.093	17.1465	0.093	17.1465
NVIDIA GeForce RTX 3070	256x256	0.025	7.42931	0.026	7.42931
NVIDIA GeForce RTX 3070	2048x2048	OOM	OOM	2.674	55.8355
NVIDIA GeForce RTX 3070	1024x1024	0.443	55.7841	0.443	55.7841

</details> <details><summary>Decoding - Stable Diffusion v1.5</summary>

GPU	Resolution	Time (seconds)	Memory (%)	Tiled Time (secs)	Tiled Memory (%)
NVIDIA GeForce RTX 4090	512x512	0.031	5.60%	0.031 (0%)	5.60%
NVIDIA GeForce RTX 4090	1024x1024	0.148	20.00%	0.301 (+103%)	5.60%
NVIDIA GeForce RTX 4080	512x512	0.05	8.40%	0.050 (0%)	8.40%
NVIDIA GeForce RTX 4080	1024x1024	0.224	30.00%	0.356 (+59%)	8.40%
NVIDIA GeForce RTX 4070 Ti	512x512	0.066	11.30%	0.066 (0%)	11.30%
NVIDIA GeForce RTX 4070 Ti	1024x1024	0.284	40.50%	0.454 (+60%)	11.40%
NVIDIA GeForce RTX 3090	512x512	0.062	5.20%	0.062 (0%)	5.20%
NVIDIA GeForce RTX 3090	1024x1024	0.253	18.50%	0.464 (+83%)	5.20%
NVIDIA GeForce RTX 3080	512x512	0.07	12.80%	0.070 (0%)	12.80%
NVIDIA GeForce RTX 3080	1024x1024	0.286	45.30%	0.466 (+63%)	12.90%
NVIDIA GeForce RTX 3070	512x512	0.102	15.90%	0.102 (0%)	15.90%
NVIDIA GeForce RTX 3070	1024x1024	0.421	56.30%	0.746 (+77%)	16.00%

</details> <details><summary>Decoding SDXL</summary>

GPU	Resolution	Time (seconds)	Memory Consumed (%)	Tiled Time (seconds)	Tiled Memory (%)
NVIDIA GeForce RTX 4090	512x512	0.057	10.00%	0.057 (0%)	10.00%
NVIDIA GeForce RTX 4090	1024x1024	0.256	35.50%	0.257 (+0.4%)	35.50%
NVIDIA GeForce RTX 4080	512x512	0.092	15.00%	0.092 (0%)	15.00%
NVIDIA GeForce RTX 4080	1024x1024	0.406	53.30%	0.406 (0%)	53.30%
NVIDIA GeForce RTX 4070 Ti	512x512	0.121	20.20%	0.120 (-0.8%)	20.20%
NVIDIA GeForce RTX 4070 Ti	1024x1024	0.519	72.00%	0.519 (0%)	72.00%
NVIDIA GeForce RTX 3090	512x512	0.107	10.50%	0.107 (0%)	10.50%
NVIDIA GeForce RTX 3090	1024x1024	0.459	38.00%	0.460 (+0.2%)	38.00%
NVIDIA GeForce RTX 3080	512x512	0.121	25.60%	0.121 (0%)	25.60%
NVIDIA GeForce RTX 3080	1024x1024	0.524	93.00%	0.524 (0%)	93.00%
NVIDIA GeForce RTX 3070	512x512	0.183	31.80%	0.183 (0%)	31.80%
NVIDIA GeForce RTX 3070	1024x1024	0.794	96.40%	0.794 (0%)	96.40%

</details>

Resources

Remote inference is also supported in SD.Next and ComfyUI-HFRemoteVae.
Refer to the Remote VAEs for decoding with Inference Endpoints blog post to learn more.