doc/source/serve/llm/user-guides/fractional-gpu.md
(fractional-gpu-guide)=
Serve multiple small models on the same GPU for cost-efficient deployments.
:::{note} This feature hasn't been extensively tested in production. If you encounter any issues, report them on GitHub with reproducible code. :::
Fractional GPU allocation allows you to run multiple model replicas on a single GPU by customizing placement groups. This approach maximizes GPU utilization and reduces costs when serving small models that don't require a full GPU's resources.
Consider fractional GPU allocation when:
The following example shows how to serve 8 replicas of a small model on 4 L4 GPUs (2 replicas per GPU):
from ray.serve.llm import LLMConfig, ModelLoadingConfig
from ray.serve.llm import build_openai_app
from ray import serve
llm_config = LLMConfig(
model_loading_config=ModelLoadingConfig(
model_id="HuggingFaceTB/SmolVLM-256M-Instruct",
),
engine_kwargs=dict(
gpu_memory_utilization=0.4,
use_tqdm_on_load=False,
enforce_eager=True,
max_model_len=2048,
),
deployment_config=dict(
autoscaling_config=dict(
min_replicas=8, max_replicas=8,
)
),
accelerator_type="L4",
placement_group_config=dict(bundles=[dict(GPU=0.49)]),
runtime_env=dict(
env_vars={
"VLLM_DISABLE_COMPILE_CACHE": "1",
},
),
)
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)
Use the following parameters to configure fractional GPU allocation. The placement group defines the GPU share, and Ray Serve infers the matching VLLM_RAY_PER_WORKER_GPUS value for you. The memory management and performance settings are vLLM-specific optimizations that you can adjust based on your model and workload requirements.
placement_group_config: Specifies the GPU fraction each replica uses. Set GPU to the fraction (for example, 0.49 for approximately half a GPU). Use slightly less than the theoretical fraction to account for system overhead—this headroom prevents out-of-memory errors.VLLM_RAY_PER_WORKER_GPUS: Ray Serve derives this from placement_group_config when GPU bundles are fractional. Setting it manually is allowed but not recommended.gpu_memory_utilization: Controls how much GPU memory vLLM pre-allocates. vLLM allocates memory based on this setting regardless of Ray's GPU scheduling. In the example, 0.4 means vLLM targets 40% of GPU memory for the model, KV cache, and CUDAGraph memory.enforce_eager: Set to True to disable CUDA graphs and reduce memory overhead.max_model_len: Limits the maximum sequence length, reducing memory requirements.use_tqdm_on_load: Set to False to disable progress bars during model loading.VLLM_DISABLE_COMPILE_CACHE: Set to 1 to avoid a resource contention issue among workers during torch compile caching.0.49 instead of 0.5) to account for system overhead.gpu_memory_utilization × GPU memory × number of replicas per GPU doesn't exceed total GPU memory.gpu_memory_utilization. This information often gets printed as part of the vLLM initialization.gpu_memory_utilization.gpu_memory_utilization (for example, from 0.4 to 0.3)max_model_len to reduce KV cache sizeenforce_eager=True if not already set to ensure CUDA graph memory requirements don't cause issuesGPU=0.49 each)placement_group_config matches the share you expect Ray to reserveVLLM_RAY_PER_WORKER_GPUS (not recommended) ensure it matches the GPU share from the placement groupVLLM_DISABLE_COMPILE_CACHE=1 is set to avoid torch compile caching conflictsQuickstart <../quick-start> - Basic LLM deployment examples