docs/source/examples/sdxl-turbo.rst
Stable Diffusion XL Turbo (SDXL Turbo) is a distilled version of SDXL 1.0 and is capable of creating images in a single step, with improved real-time text-to-image output quality and sampling fidelity.
This document demonstrates how to serve SDXL Turbo with BentoML.
.. raw:: html
<div style="display: flex; justify-content: space-between; margin-bottom: 20px;">
<div style="border: 1px solid #ccc; padding: 10px; border-radius: 10px; background-color: #f9f9f9; flex-grow: 1; margin-right: 10px; text-align: center;">
<a href="https://github.com/bentoml/BentoDiffusion" style="margin-left: 5px; vertical-align: middle;">Source Code</a>
</div>
<div style="border: 1px solid #ccc; padding: 10px; border-radius: 10px; background-color: #f9f9f9; flex-grow: 1; margin-left: 10px; text-align: center;">
<a href="#bentocloud" style="margin-left: 5px; vertical-align: middle;">Deploy to BentoCloud</a>
</div>
<div style="border: 1px solid #ccc; padding: 10px; border-radius: 10px; background-color: #f9f9f9; flex-grow: 1; margin-left: 10px; text-align: center;">
<a href="#localserving" style="margin-left: 5px; vertical-align: middle;">Serve with BentoML</a>
</div>
</div>
The resulting inference API accepts custom parameters for image generation. For example, you can send a query containing the following:
.. code-block:: bash
{
"guidance_scale": 0,
"num_inference_steps": 1,
"prompt": "A cinematic shot of a baby racoon wearing an intricate italian priest robe."
}
Example output:
.. image:: ../../_static/img/examples/sdxl-turbo/output-image.png :align: center :alt: Generated image of a baby raccoon wearing an Italian priest robe, created by SDXL Turbo based on the example prompt
This example is ready for quick deployment and scaling on BentoCloud. With a single command, you get a production-grade application with fast autoscaling, secure deployment in your cloud, and comprehensive observability.
.. image:: ../../_static/img/examples/sdxl-turbo/sdxl-turbo-bentocloud.png :alt: Screenshot of SDXL Turbo deployed on BentoCloud showing the image generation interface with prompt input and parameter controls
You can find the source code in GitHub <https://github.com/bentoml/BentoDiffusion/tree/main/sdxl-turbo>_. Below is a breakdown of the key code implementations within this project.
Define the SDXL Turbo model ID. You can switch to any other diffusion model as needed.
.. code-block:: python
:caption: service.py
MODEL_ID = "stabilityai/sdxl-turbo"
Use the @bentoml.service decorator to define a BentoML Service, where you can customize how the model will be served. The decorator lets you set :doc:configurations </reference/bentoml/configurations> like timeout and GPU resources to use on BentoCloud. Note that SDXL Turbo requires at least an NVIDIA L4 GPU for optimal performance.
.. code-block:: python
:caption: service.py
@bentoml.service( traffic={"timeout": 300}, resources={ "gpu": 1, "gpu_type": "nvidia-l4", }, ) class SDXLTurbo: model_path = bentoml.models.HuggingFaceModel(MODEL_ID) ...
Within the class, :ref:load the model from Hugging Face <load-models> and define it as a class variable. The HuggingFaceModel method provides an efficient mechanism for loading AI models to accelerate model deployment on BentoCloud, reducing image build time and cold start time.
The @bentoml.service decorator also allows you to :doc:define the runtime environment </build-with-bentoml/runtime-environment> for a Bento, the unified distribution format in BentoML. A Bento is packaged with all the source code, Python dependencies, model references, and environment setup, making it easy to deploy consistently across different environments.
Here is an example:
.. code-block:: python
:caption: service.py
my_image = bentoml.images.Image(python_version="3.11")
.requirements_file("requirements.txt")
@bentoml.service( image=my_image, # Apply the specifications ... ) class SDXLTurbo: ...
Use the @bentoml.api decorator to define an API endpoint for image generation inference. The txt2img method is an endpoint that takes a text prompt, number of inference steps, and a guidance scale as inputs. It uses the model pipeline to generate an image based on the given prompt and parameters.
.. code-block:: python
:caption: service.py
class SDXLTurbo: model_path = bentoml.models.HuggingFaceModel(MODEL_ID)
def __init__(self) -> None:
from diffusers import AutoPipelineForText2Image
import torch
# Load the model
self.pipe = AutoPipelineForText2Image.from_pretrained(
self.model_path,
torch_dtype=torch.float16,
variant="fp16",
)
# Move the pipeline to GPU
self.pipe.to(device="cuda")
@bentoml.api
def txt2img(
self,
prompt: str = sample_prompt,
num_inference_steps: Annotated[int, Ge(1), Le(10)] = 1,
guidance_scale: float = 0.0,
) -> Image:
image = self.pipe(
prompt=prompt,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
).images[0]
return image
You can run this example project <https://github.com/bentoml/BentoDiffusion/tree/main/sdxl-turbo>_ on BentoCloud, or serve it locally, containerize it as an OCI-compliant image, and deploy it anywhere.
.. _BentoCloud:
BentoCloud ^^^^^^^^^^
.. raw:: html
<a id="bentocloud"></a>
BentoCloud provides fast and scalable infrastructure for building and scaling AI applications with BentoML in the cloud.
Install BentoML and :doc:log in to BentoCloud </scale-with-bentocloud/manage-api-tokens> through the BentoML CLI. If you don't have a BentoCloud account, sign up here for free <https://www.bentoml.com/>_.
.. code-block:: bash
pip install bentoml bentoml cloud login
Clone the BentoDiffusion repository <https://github.com/bentoml/BentoDiffusion>_ and deploy the project.
.. code-block:: bash
git clone https://github.com/bentoml/BentoDiffusion.git cd BentoDiffusion/sdxl-turbo bentoml deploy
Once it is up and running on BentoCloud, you can call the endpoint in the following ways:
.. tab-set::
.. tab-item:: BentoCloud Playground
.. image:: ../../_static/img/examples/sdxl-turbo/sdxl-turbo-bentocloud.png
:alt: Screenshot of SDXL Turbo deployed on BentoCloud showing the image generation interface with prompt input and parameter controls
.. tab-item:: Python client
Create a :doc:`BentoML client </build-with-bentoml/clients>` to call the endpoint. Make sure you replace the Deployment URL with your own on BentoCloud. Refer to :ref:`scale-with-bentocloud/deployment/call-deployment-endpoints:obtain the endpoint url` for details.
.. code-block:: python
import bentoml
from pathlib import Path
# Define the path to save the generated image
output_path = Path("generated_image.png")
with bentoml.SyncHTTPClient("https://sdxl-turbo-nmsx-e3c1c7db.mt-guc1.bentoml.ai") as client:
result = client.txt2img(
guidance_scale=0,
num_inference_steps=1,
prompt="A cinematic shot of a baby racoon wearing an intricate italian priest robe.",
)
# The result should be a PIL.Image object
result.save(output_path)
print(f"Image saved at {output_path}")
.. tab-item:: CURL
Make sure you replace the Deployment URL with your own on BentoCloud. Refer to :ref:`scale-with-bentocloud/deployment/call-deployment-endpoints:obtain the endpoint url` for details.
.. code-block:: bash
curl -s -X POST \
'https://sdxl-turbo-nmsx-e3c1c7db.mt-guc1.bentoml.ai/txt2img' \
-H 'Content-Type: application/json' \
-d '{
"guidance_scale": 0,
"num_inference_steps": 1,
"prompt": "A cinematic shot of a baby racoon wearing an intricate italian priest robe."
}' \
-o output.jpg
.. note::
SDXL Turbo is capable of performing inference with just a single step. Therefore, setting num_inference_steps to 1 is typically sufficient for generating high-quality images. Additionally, you need to set guidance_scale to 0 to deactivate it as the model was trained without it. See the official release notes <https://github.com/huggingface/diffusers/releases/tag/v0.24.0>_ to learn more.
To make sure the Deployment automatically scales within a certain replica range, add the scaling flags:
.. code-block:: bash
bentoml deploy --scaling-min 0 --scaling-max 3 # Set your desired count
If it's already deployed, update its allowed replicas as follows:
.. code-block:: bash
bentoml deployment update <deployment-name> --scaling-min 0 --scaling-max 3 # Set your desired count
For more information, see :doc:how to configure concurrency and autoscaling </scale-with-bentocloud/scaling/autoscaling>.
.. _LocalServing:
Local serving ^^^^^^^^^^^^^
.. raw:: html
<a id="localserving"></a>
BentoML allows you to run and test your code locally, so that you can quickly validate your code with local compute resources.
Clone the repository and choose your desired project.
.. code-block:: bash
git clone https://github.com/bentoml/BentoDiffusion.git cd BentoDiffusion/sdxl-turbo
pip install -r requirements.txt
Serve it locally.
.. code-block:: bash
bentoml serve
.. note::
To run this project with SDXL Turbo, you need an Nvidia GPU with at least 12G VRAM.
Visit or send API requests to http://localhost:3000 <http://localhost:3000/>_.
For custom deployment in your own infrastructure, use BentoML to :doc:generate an OCI-compliant image </get-started/packaging-for-deployment>.