docs/features/speculative_decoding/speculators.md
Speculators is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput.
Speculators provides the following key features:
Large language models generate text one token at a time, which creates a fundamental bottleneck: each token requires a full forward pass through the model, leaving GPU compute underutilized while waiting for memory-bound operations. Speculative decoding addresses this by using a smaller, faster "draft" model (often times, just a single transformer layer) to predict multiple tokens ahead, and then verifying tokens in parallel with the primary model.
Speculative decoding provides the following benefits:
Speculators is particularly valuable for latency-sensitive applications where users are waiting for responses in real-time, such as conversational AI, interactive coding assistants, and streaming text generation.