Back to Vllm

Features

docs/features/README.md

0.20.15.1 KB
Original Source

Features

Compatibility Matrix

The tables below show mutually exclusive features and the support on some hardware.

The symbols used have the following meanings:

  • ✅ = Full compatibility
  • 🟠 = Partial compatibility
  • ❌ = No compatibility
  • ❔ = Unknown or TBD

!!! note Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.

Feature x Feature

<style> td:not(:first-child) { text-align: center !important; } td { padding: 0.5rem !important; white-space: nowrap; } th { padding: 0.5rem !important; min-width: 0 !important; } th:not(:first-child) { writing-mode: vertical-lr; transform: rotate(180deg) } </style>
FeatureCPAPCLoRASDCUDA graphpooling<abbr title="Encoder-Decoder Models">enc-dec</abbr><abbr title="Logprobs">logP</abbr><abbr title="Prompt Logprobs">prmpt logP</abbr><abbr title="Async Output Processing">async output</abbr>multi-step<abbr title="Multimodal Inputs">mm</abbr>best-ofbeam-searchprompt-embeds
CP
APC
LoRA
SD
CUDA graph
pooling🟠*🟠*
<abbr title="Encoder-Decoder Models">enc-dec</abbr>
<abbr title="Logprobs">logP</abbr>
<abbr title="Prompt Logprobs">prmpt logP</abbr>
<abbr title="Async Output Processing">async output</abbr>
multi-step
mm🟠<sup>^</sup>
best-of
beam-search
prompt-embeds

* Chunked prefill and prefix caching are only applicable to last-token or all pooling with causal attention.
<sup>^</sup> LoRA is only applicable to the language backbone of multimodal models.

Feature x Hardware

FeatureVoltaTuringAmpereAdaHopperCPUAMDIntel GPU
CP
APC
LoRA
SD
CUDA graph
pooling
<abbr title="Encoder-Decoder Models">enc-dec</abbr>
mm
prompt-embeds
<abbr title="Logprobs">logP</abbr>
<abbr title="Prompt Logprobs">prmpt logP</abbr>
<abbr title="Async Output Processing">async output</abbr>
multi-step
best-of
beam-search

!!! note For information on feature support on Google TPU, please refer to the TPU-Inference Recommended Models and Features documentation.