skills/mlops/inference/vllm/references/optimization.md
Traditional attention problem:
PagedAttention solution:
Memory savings example:
Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100
Configuration:
# Block size (default: 16 tokens)
vllm serve MODEL --block-size 16
# Number of GPU blocks (auto-calculated)
# Controlled by --gpu-memory-utilization
vllm serve MODEL --gpu-memory-utilization 0.9
Traditional batching:
Continuous batching:
Throughput improvement:
Traditional batching: 50 req/sec @ 50% GPU util
Continuous batching: 200 req/sec @ 90% GPU util
= 4x throughput improvement
Tuning parameters:
# Max concurrent sequences (higher = more batching)
vllm serve MODEL --max-num-seqs 256
# Prefill/decode schedule (auto-balanced by default)
# No manual tuning needed
Reuse computed KV cache for common prompt prefixes.
Use cases:
Example savings:
Prompt: [System: 500 tokens] + [User: 100 tokens]
Without caching: Compute 600 tokens every request
With caching: Compute 500 tokens once, then 100 tokens/request
= 83% faster TTFT
Enable prefix caching:
vllm serve MODEL --enable-prefix-caching
Automatic prefix detection:
Cache hit rate monitoring:
curl http://localhost:9090/metrics | grep cache_hit
# vllm_cache_hit_rate: 0.75 (75% hit rate)
Use smaller "draft" model to propose tokens, larger model to verify.
Speed improvement:
Standard: Generate 1 token per forward pass
Speculative: Generate 3-5 tokens per forward pass
= 2-3x faster generation
How it works:
Setup with separate draft model:
vllm serve meta-llama/Llama-3-70B-Instruct \
--speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--num-speculative-tokens 5
Setup with n-gram draft (no separate model):
vllm serve MODEL \
--speculative-method ngram \
--num-speculative-tokens 3
When to use:
vLLM vs HuggingFace Transformers (Llama 3 8B, A100):
Metric | HF Transformers | vLLM | Improvement
------------------------|-----------------|--------|------------
Throughput (req/sec) | 12 | 280 | 23x
TTFT (ms) | 850 | 120 | 7x
Tokens/sec | 45 | 2,100 | 47x
GPU Memory (GB) | 28 | 16 | 1.75x less
vLLM vs TensorRT-LLM (Llama 2 70B, 4x A100):
Metric | TensorRT-LLM | vLLM | Notes
------------------------|--------------|--------|------------------
Throughput (req/sec) | 320 | 285 | TRT 12% faster
Setup complexity | High | Low | vLLM much easier
NVIDIA-only | Yes | No | vLLM multi-platform
Quantization support | FP8, INT8 | AWQ/GPTQ/FP8 | vLLM more options
Step 1: Measure baseline
# Install benchmarking tool
pip install locust
# Run baseline benchmark
vllm bench throughput \
--model MODEL \
--input-tokens 128 \
--output-tokens 256 \
--num-prompts 1000
# Record: throughput, TTFT, tokens/sec
Step 2: Tune memory utilization
# Try different values: 0.7, 0.85, 0.9, 0.95
vllm serve MODEL --gpu-memory-utilization 0.9
Higher = more batch capacity = higher throughput, but risk OOM.
Step 3: Tune concurrency
# Try values: 128, 256, 512, 1024
vllm serve MODEL --max-num-seqs 256
Higher = more batching opportunity, but may increase latency.
Step 4: Enable optimizations
vllm serve MODEL \
--enable-prefix-caching \ # For repeated prompts
--enable-chunked-prefill \ # For long prompts
--gpu-memory-utilization 0.9 \
--max-num-seqs 512
Step 5: Re-benchmark and compare
Target improvements:
Common performance issues:
Low throughput (<50 req/sec):
--max-num-seqs--enable-prefix-cachingHigh TTFT (>1 second):
--enable-chunked-prefill--max-model-len if possibleOOM errors:
--gpu-memory-utilization to 0.7--max-model-len--quantization awq)