cookbook/90_models/vllm/README.md
vLLM is a fast and easy-to-use library for running LLM models locally.
python3 -m venv ~/.venvs/aienv
source ~/.venvs/aienv/bin/activate
uv pip install vllm
vllm serve Qwen/Qwen2.5-7B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--dtype float16 \
--max-model-len 2048 \
--gpu-memory-utilization 0.9
vLLM embedders can load and run embedding models locally without requiring a server.
Install vLLM (if not already installed):
uv pip install vllm
Choose an embedding model:
Recommended models:
intfloat/e5-mistral-7b-instruct (4096 dimensions, 7B parameters)BAAI/bge-large-en-v1.5 (1024 dimensions, 335M parameters)sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, 22M parameters)GPU Requirements:
Usage:
from agno.knowledge.embedder.vllm import VLLMEmbedder
# Local mode (no server needed)
embedder = VLLMEmbedder(
id="intfloat/e5-mistral-7b-instruct",
dimensions=4096
)
# Get embeddings
embedding = embedder.get_embedding("Hello world")
print(f"Embedding dimension: {len(embedding)}")
Examples:
cookbook/07_knowledge/embedders/vllm_embedder.pycookbook/07_knowledge/embedders/vllm_embedder_batching.pyLocal Mode (no server):
VLLMEmbedder(id="model-name")base_url neededRemote Mode (requires server):
VLLMEmbedder(base_url="http://localhost:8000/v1")Enable batching for multiple embeddings:
embedder = VLLMEmbedder(
id="intfloat/e5-mistral-7b-instruct",
enable_batch=True,
batch_size=32 # Adjust based on GPU memory
)
Use smaller models for faster inference if precision isn't critical
For CPU-only: Use smaller models (bge-small, MiniLM)
python cookbook/92_models/vllm/basic.py