docs_new/docs/basic_usage/anthropic_api.mdx
SGLang ships an Anthropic-compatible /v1/messages endpoint so any client built for the Anthropic
Messages API — including the Anthropic SDKs and agentic CLIs such as Claude Code — can talk to a
self-hosted SGLang server without changes. A complete reference for the API is available in the
Anthropic API Reference.
The endpoint is registered automatically on every SGLang server; no extra flag is required to enable it.
It reuses the same model, chat template, and reasoning / tool-call parsers as the OpenAI-compatible
endpoint, and supports both non-streaming and streaming responses, tool use, and a count_tokens route.
This tutorial covers:
POST /v1/messages (non-streaming and streaming)POST /v1/messages/count_tokensCLAUDE_CODE_ATTRIBUTION_HEADER setting that is
required for good prefix-cache reuse.Launch the server in your terminal and wait for it to initialize. The Anthropic /v1/messages endpoint
is registered automatically — no extra flag is required beyond the usual server launch. The example below
is a single-node GLM-5.2-FP8 config; see the
GLM-5.2 cookbook for verified commands
across hardware and quantizations.
sglang serve \
--model-path zai-org/GLM-5.2-FP8 \
--tp 8 \
--speculative-algorithm EAGLE \
--speculative-num-steps 5 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 6 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--host 0.0.0.0 \
--port 30000
Use the Anthropic Python SDK pointed at the server. Unlike the OpenAI SDK, the Anthropic SDK appends
/v1/messages itself, so base_url is the server root without a /v1 suffix.
from anthropic import Anthropic
client = Anthropic(
base_url="http://127.0.0.1:30000",
api_key="EMPTY", # SGLang does not require a real key by default
)
message = client.messages.create(
model="zai-org/GLM-5.2-FP8",
max_tokens=512,
messages=[{"role": "user", "content": "List 3 countries and their capitals."}],
)
# A reasoning model may emit a `thinking` block before the `text` block —
# pick the text block rather than assuming content[0].
print(next(b.text for b in message.content if b.type == "text"))
Example Output:
Here are 3 countries and their capitals:
1. **France** - Paris
2. **Japan** - Tokyo
3. **Brazil** - Brasília
Set stream=True to receive Server-Sent Events as they are produced.
with client.messages.stream(
model="zai-org/GLM-5.2-FP8",
max_tokens=512,
messages=[{"role": "user", "content": "Say this is a test"}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Example Output:
This is a test.
The top-level system field is accepted as a string or as a list of text blocks, matching the Anthropic
API shape:
message = client.messages.create(
model="zai-org/GLM-5.2-FP8",
max_tokens=512,
system="You are a helpful assistant that answers concisely.",
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(next(b.text for b in message.content if b.type == "text"))
Example Output:
The capital of France is Paris.
Tool definitions follow the Anthropic tools schema. When the server is launched with a
--tool-call-parser, the model's tool calls are returned as tool_use content blocks:
message = client.messages.create(
model="zai-org/GLM-5.2-FP8",
max_tokens=512,
tools=[
{
"name": "get_weather",
"description": "Get the weather for a city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
}
],
messages=[{"role": "user", "content": "What is the weather in Paris?"}],
)
print(message.stop_reason)
print([b for b in message.content if b.type == "tool_use"])
Example Output:
tool_use
[ToolUseBlock(type='tool_use', id='toolu_01XXXX', name='get_weather', input={'city': 'Paris'})]
POST /v1/messages/count_tokens returns the tokenized length of a request without generating a
response. It reuses the same request conversion as /v1/messages, so system prompts, tools, and
multi-turn history are all accounted for.
resp = client.messages.count_tokens(
model="zai-org/GLM-5.2-FP8",
messages=[{"role": "user", "content": "Hello, world"}],
)
print(resp.input_tokens)
Example Output:
15
Claude Code can be pointed at an SGLang server by setting a few env vars in the shell that starts it.
With the server already running on :30000, export the full set and launch claude:
export ANTHROPIC_BASE_URL="http://127.0.0.1:30000"
export ANTHROPIC_AUTH_TOKEN="dummy" # required by Claude Code; any non-empty string works
export API_TIMEOUT_MS="3000000" # long timeout — reasoning + 1M-context turns are slow
export CLAUDE_CODE_AUTO_COMPACT_WINDOW="1000000" # let auto-compact use the full 1M window
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 # drop autoupdater/telemetry/error-reporting noise
export CLAUDE_CODE_ATTRIBUTION_HEADER=0 # required for prefix-cache reuse — see below
export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-5.2[1m]" # [1m] suffix enables Claude Code's 1M-context beta
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]" # [1m] suffix enables Claude Code's 1M-context beta
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" # [1m] suffix enables Claude Code's 1M-context beta
claude
Each var matters:
ANTHROPIC_BASE_URL — points Claude Code at your SGLang server instead of the Anthropic API.ANTHROPIC_AUTH_TOKEN — Claude Code requires a non-empty auth token; SGLang accepts any value
when launched without --api-key.API_TIMEOUT_MS — raise it; reasoning models with long outputs and 1M-context turns routinely
exceed the default timeout.ANTHROPIC_DEFAULT_{HAIKU,SONNET,OPUS}_MODEL — the model name Claude Code sends for each tier.
SGLang does not validate this field, so any name works. Use glm-5.2[1m]: the [1m] suffix is a
client-side hint that enables Claude Code's 1M-context beta (without it, context is capped).CLAUDE_CODE_AUTO_COMPACT_WINDOW — set to 1000000 so auto-compaction uses the full 1M window
instead of the default, keeping long sessions alive.{
"env": {
"ANTHROPIC_BASE_URL": "http://127.0.0.1:30000",
"ANTHROPIC_AUTH_TOKEN": "dummy",
"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-5.2[1m]",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.2[1m]",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5.2[1m]"
}
}
CLAUDE_CODE_ATTRIBUTION_HEADER=0 for prefix-cache reuseClaude Code prepends a per-request attribution block to the start of the system prompt, of the form
x-anthropic-billing-header: cc_version=<ver>.<per-request-hash>; cc_entrypoint=...; cch=<hash>;. The
per-request hash is the first token to differ between turns, so the radix prefix cache can only reuse
the short prefix before that hash and re-prefills the system prompt plus the entire conversation history
on every turn.
Setting CLAUDE_CODE_ATTRIBUTION_HEADER=0 removes the whole attribution line from the system prompt.
This is a documented Claude Code env var whose explicit purpose is to "improve prompt-cache hit rates when
routing through an LLM gateway" (see the Claude Code env-vars reference).
Connection refused / fetch failed — Ensure the server is up and the port in ANTHROPIC_BASE_URL
matches --port (default 30000). If you set ANTHROPIC_BASE_URL to a remote host, confirm it's reachable
and not behind a proxy that blocks the connection.
Model not found / 404 from the server — SGLang does not validate the request model field and
serves whatever model was loaded at startup, so a 404 usually means the request did not reach the
/v1/messages route at all. Confirm ANTHROPIC_BASE_URL points at the server (not missing the port) and
that the server finished loading.
Tool calls not working / returned as raw text — Launch the server with the correct
--tool-call-parser for your model (e.g. glm47, qwen3). Without it the tools field is still accepted
but the model's tool calls come back as text instead of tool_use blocks, and Claude Code cannot execute
them.
Slow / re-prefills the whole history every turn — You are missing
CLAUDE_CODE_ATTRIBUTION_HEADER=0. Claude Code's per-request attribution hash in the system prompt
defeats radix prefix-cache reuse; see the section above.
Context capped below 1M — The model name must end in [1m] for Claude Code to enable its 1M-context
beta. Verify ANTHROPIC_DEFAULT_*_MODEL uses the [1m] suffix, and that the loaded model's native context
is 1M (GLM-5.2 is 1048576; pass --context-length only to cap it, not to extend).
The /v1/messages endpoint accepts the standard Anthropic Messages API parameters. Refer to the
Anthropic Messages API reference for the full list.
Reasoning models are supported through the same --reasoning-parser mechanism as the OpenAI-compatible
endpoint; pass the model's reasoning kwarg via the request (e.g. thinking for DeepSeek-V3-style models,
enable_thinking for Qwen3-style models). See OpenAI APIs - Completions for
the reasoning-parser / chat-template mapping.