doc/source/serve/tutorials/deployment-serve-llm/content/README.md
These guides provide a fast path to serving LLMs using Ray Serve on Anyscale, with focused tutorials for different deployment scales, from single-GPU setups to multi-node clusters.
Each tutorial includes development and production setups, tips for configuring your cluster, and guidance on monitoring and scaling with Ray Serve.
Ray Serve LLM provides production-grade features beyond what standalone vLLM offers:
Horizontal scaling: Replicate your model across multiple GPUs or nodes and automatically balance traffic across replicas. As request volume grows, Ray Serve automatically adds more replicas to handle the load.
Production readiness: Ray Serve provides built-in autoscaling, fault tolerance, rolling updates, and comprehensive monitoring through Grafana dashboards. The system handles replica failures gracefully and scales based on traffic patterns.
Multi-model serving: Deploy multiple models with different configurations on the same cluster. Each model can have its own autoscaling policy and resource requirements.
Modular architecture: Separate your application logic from infrastructure concerns. You can customize request routing, add authentication layers, or integrate with existing systems without modifying your model serving code.
For simple single-GPU deployments or experimentation, standalone vLLM might be sufficient. However, for production workloads that need to scale, handle failures, or serve multiple models efficiently, Ray Serve provides the infrastructure you need.
Ray Serve LLM is built on two main components that work together to serve your model:
LLMServer: A Ray Serve deployment that manages a vLLM engine instance. Each replica of this deployment:
OpenAiIngress: A FastAPI-based ingress deployment that:
/v1/chat/completions, etc.).When you call build_openai_app, Ray Serve LLM creates both components and connects them automatically. The ingress receives HTTP requests and forwards them to available LLMServer replicas through deployment handles. This architecture enables:
For detailed technical information, including diagrams of request flow and placement strategies, see the Architecture overview.
Deploy a small-sized LLM
Deploy small-sized models on a single GPU, such as Llama 3 8 B, Mistral 7 B, or Phi-2.
Deploy a medium-sized LLM
Deploy medium-sized models using tensor parallelism across 4—8 GPUs on a single node, such as Llama 3 70 B, Qwen 14 B, Mixtral 8x7 B.
Deploy a large-sized LLM
Deploy massive models using pipeline parallelism across a multi-node cluster, such as Deepseek-R1 or Llama-Nemotron-253 B.
Deploy a vision LLM
Deploy models with image and text input such as Qwen 2.5-VL-7 B-Instruct, MiniGPT-4, or Pixtral-12 B.
Deploy a reasoning LLM
Deploy models with reasoning capabilities designed for long-context tasks, coding, or tool use, such as QwQ-32 B.
Deploy a hybrid reasoning LLM
Deploy models that can switch between reasoning and non-reasoning modes for flexible usage, such as Qwen-3.
Deploy gpt-oss
Deploy gpt-oss reasoning models for high-reasoning, production-scale workloads, for lower latency (gpt-oss-20b) and high-reasoning (gpt-oss-120b) use cases.