Back to Localai

vLLM streaming + tool-parser benchmark

examples/vllm-bench/README.md

4.5.02.4 KB
Original Source

vLLM streaming + tool-parser benchmark

A small, self-contained Python script (stdlib only) that measures time-to-first-token (TTFT) for the vLLM backend's streaming path with a tool parser configured.

Why this exists

When a vLLM tool parser is active and a streaming chat completion is requested, LocalAI used to buffer the full generation to prevent raw tool-call markup (e.g. <tool_call>...) from leaking as delta.content. That was correct for tool-call responses, but it turned plain-text responses into effectively non-streaming — the client received nothing until the model finished.

With native parser-side streaming (parser.extract_tool_calls_streaming, implemented by every concrete vLLM 0.23+ tool parser), each delta can be classified per-token: emit as content, emit as a structured tool_call, or suppress.

Three scenarios

ScenarioRequestExpected outcome
tool_call"What is the weather in Paris? Please use the tool."Model calls get_weather. delta.tool_calls chunks; no content leak.
plain_text_short"Explain in 3 short sentences what a hash table is. Do NOT call any tool."Model writes ~3 sentences.
plain_text_long"Write a thorough 8-paragraph explanation of how Python's GIL works…"Model writes ~1500 tokens of prose.

The long scenario is where the streaming/buffering difference is most dramatic: with the buffer-all path, the client sees nothing for 20+ seconds and then everything at once; with native streaming, the first token arrives in <100ms and the response flows progressively.

What the script reports

For each scenario, across N runs:

  • ttf_content_s — time until the first delta.content chunk
  • ttf_tool_s — time until the first delta.tool_calls chunk
  • n_content_chunks — total content deltas (1 = bundled, >>1 = streamed)
  • n_tool_chunks — total tool_call deltas
  • total_s — total wall-clock until [DONE]
  • finish_reasontool_calls / stop / length

The big tell is n_content_chunks vs total_s ratio:

  • Buffer-all: n_content_chunks ≈ 1, ttf_content_stotal_s (one chunk at end)
  • Streaming: n_content_chunks ≈ token count, ttf_content_s ≈ first-token latency

Usage

bash
python ttft_streaming_tool_parser.py --url http://localhost:8080 --model my-coder --runs 3

JSON results are written to ttft_bench_<label>.json (default label: run).