docs/serving/offline_inference.md
Offline inference is possible in your own code using vLLM's [LLM][vllm.LLM] class.
vLLM models can be categorized into two types:
Generative Models - Models that produce text completions or chat responses (e.g., LLaMA, Qwen, DeepSeek). Use LLM.generate() and LLM.chat() for these models.
Pooling Models - These models do not generate content. They are primarily used for classification and retrieval tasks, such as bge-m3 and Qwen3 Reranker.
For further details on generative models, please refer to this page.
LLM.generate - Generates completions for the given input prompts.LLM.chat - Generates responses for a chat conversation.LLM.enqueue - Enqueues prompts for generation without waiting for completion.LLM.enqueue_chat - Enqueues chat conversations for generation without waiting.LLM.wait_for_completion - Waits for all enqueued requests to complete and returns results.For further details on pooling models, please refer to this page.
LLM.classify - Only applicable to classification models.LLM.embed - Only applicable to embedding models.LLM.score - Applicable to score models (cross-encoder, bi-encoder, late-interaction).LLM.encode - Applicable to all pooling models.For further details on profiling, please refer to this page.
LLM.start_profile - Starts profiling with an optional custom trace prefix.LLM.stop_profile - Stops the ongoing profiling session.For further details on sleep mode, please refer to this page.
LLM.sleep - Puts the engine into sleep mode.LLM.wake_up - Wakes up the engine from sleep mode.LLM.reset_mm_cache - Resets the multi-modal cache.LLM.reset_prefix_cache - Resets the prefix cache.For further details on metrics, please refer to this page.
LLM.get_metrics - Returns a snapshot of aggregated metrics from Prometheus.For further details on Weight Transfer, please refer to this page.
LLM.init_weight_transfer_engine - Initializes the weight transfer engine for RL training.LLM.start_weight_update - Starts a new weight update cycle.LLM.update_weights - Updates the model weights.LLM.finish_weight_update - Finishes the current weight update cycle.LLM.collective_rpc - Executes a method or callable collectively across all workers.LLM.apply_model - Applies a function directly to the model inside each worker.Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine. This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
??? code
```python
import ray # Requires ray>=2.44.1
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
config = vLLMEngineProcessorConfig(model_source="unsloth/Llama-3.2-1B-Instruct")
processor = build_llm_processor(
config,
preprocess=lambda row: {
"messages": [
{"role": "system", "content": "You are a bot that completes unfinished haikus."},
{"role": "user", "content": row["item"]},
],
"sampling_params": {"temperature": 0.3, "max_tokens": 250},
},
postprocess=lambda row: {"answer": row["generated_text"]},
)
ds = ray.data.from_items(["An old silent pond..."])
ds = processor(ds)
ds.write_parquet("local:///tmp/data/")
```
For more information about the Ray Data LLM API, see the Ray Data LLM documentation.