examples/vllm-bench/README.md
A small, self-contained Python script (stdlib only) that measures time-to-first-token (TTFT) for the vLLM backend's streaming path with a tool parser configured.
When a vLLM tool parser is active and a streaming chat completion is requested,
LocalAI used to buffer the full generation to prevent raw tool-call markup
(e.g. <tool_call>...) from leaking as delta.content. That was correct
for tool-call responses, but it turned plain-text responses into effectively
non-streaming — the client received nothing until the model finished.
With native parser-side streaming (parser.extract_tool_calls_streaming,
implemented by every concrete vLLM 0.23+ tool parser), each delta can be
classified per-token: emit as content, emit as a structured tool_call, or
suppress.
| Scenario | Request | Expected outcome |
|---|---|---|
tool_call | "What is the weather in Paris? Please use the tool." | Model calls get_weather. delta.tool_calls chunks; no content leak. |
plain_text_short | "Explain in 3 short sentences what a hash table is. Do NOT call any tool." | Model writes ~3 sentences. |
plain_text_long | "Write a thorough 8-paragraph explanation of how Python's GIL works…" | Model writes ~1500 tokens of prose. |
The long scenario is where the streaming/buffering difference is most dramatic: with the buffer-all path, the client sees nothing for 20+ seconds and then everything at once; with native streaming, the first token arrives in <100ms and the response flows progressively.
For each scenario, across N runs:
ttf_content_s — time until the first delta.content chunkttf_tool_s — time until the first delta.tool_calls chunkn_content_chunks — total content deltas (1 = bundled, >>1 = streamed)n_tool_chunks — total tool_call deltastotal_s — total wall-clock until [DONE]finish_reason — tool_calls / stop / lengthThe big tell is n_content_chunks vs total_s ratio:
n_content_chunks ≈ 1, ttf_content_s ≈ total_s (one chunk at end)n_content_chunks ≈ token count, ttf_content_s ≈ first-token latencypython ttft_streaming_tool_parser.py --url http://localhost:8080 --model my-coder --runs 3
JSON results are written to ttft_bench_<label>.json (default label: run).