packages/kilo-docs/pages/gateway/streaming.md
The Kilo AI Gateway supports streaming responses from all models using Server-Sent Events (SSE). Streaming allows your application to display tokens as they're generated, providing a more responsive user experience.
Set stream: true in your request body to enable streaming:
{
"model": "anthropic/claude-sonnet-4.5",
"messages": [{ "role": "user", "content": "Write a short story" }],
"stream": true
}
{% callout type="info" %}
The gateway automatically injects stream_options.include_usage = true on all streaming requests, so you always receive token usage information in the final chunk.
{% /callout %}
The Vercel AI SDK handles SSE parsing and provides a clean streaming interface:
import { streamText } from "ai"
import { createOpenAI } from "@ai-sdk/openai"
const kilo = createOpenAI({
baseURL: "https://api.kilo.ai/api/gateway",
apiKey: process.env.KILO_API_KEY,
})
const result = streamText({
model: kilo.chat("anthropic/claude-sonnet-4.5"),
prompt: "Write a short story about a robot.",
})
for await (const textPart of result.textStream) {
process.stdout.write(textPart)
}
// Access usage data after streaming completes
const usage = await result.usage
console.log("Tokens used:", usage)
{% tabs %} {% tab label="TypeScript" %}
import OpenAI from "openai"
const client = new OpenAI({
apiKey: process.env.KILO_API_KEY,
baseURL: "https://api.kilo.ai/api/gateway",
})
const stream = await client.chat.completions.create({
model: "anthropic/claude-sonnet-4.5",
messages: [{ role: "user", content: "Write a short story" }],
stream: true,
})
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content
if (content) {
process.stdout.write(content)
}
}
{% /tab %} {% tab label="Python" %}
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("KILO_API_KEY"),
base_url="https://api.kilo.ai/api/gateway",
)
stream = client.chat.completions.create(
model="anthropic/claude-sonnet-4.5",
messages=[{"role": "user", "content": "Write a short story"}],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
{% /tab %} {% /tabs %}
When streaming, the gateway returns data in SSE format. Each event is a JSON object prefixed with data: :
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"anthropic/claude-sonnet-4.5","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"anthropic/claude-sonnet-4.5","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"anthropic/claude-sonnet-4.5","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}
data: [DONE]
Token usage data is included in the final chunk before [DONE], with an empty choices array:
{
"id": "chatcmpl-abc123",
"object": "chat.completion.chunk",
"usage": {
"prompt_tokens": 12,
"completion_tokens": 150,
"total_tokens": 162
},
"choices": []
}
You can cancel a streaming request by aborting the connection. This stops token generation and billing for ungenerated tokens:
const controller = new AbortController()
const response = await fetch("https://api.kilo.ai/api/gateway/chat/completions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.KILO_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "anthropic/claude-sonnet-4.5",
messages: [{ role: "user", content: "Write a long essay" }],
stream: true,
}),
signal: controller.signal,
})
// Cancel after 5 seconds
setTimeout(() => controller.abort(), 5000)
{% callout type="warning" %} Stream cancellation behavior depends on the upstream provider. Some providers stop processing immediately, while others may continue processing after disconnection. The gateway handles partial usage tracking for cancelled streams. {% /callout %}
If an error occurs before any tokens are sent, the gateway returns a standard JSON error response with the appropriate HTTP status code:
{
"error": {
"message": "Insufficient balance",
"code": 402
}
}
If an error occurs after tokens have already been sent, the HTTP status (200) cannot be changed. The error appears as an SSE event:
data: {"error":{"message":"Provider disconnected","code":502},"choices":[{"index":0,"delta":{"content":""},"finish_reason":"error"}]}
Check for finish_reason: "error" to detect mid-stream errors in your client code.
For parsing SSE streams, we recommend these libraries: