Back to Sglang

Anthropic-Compatible API

docs_new/docs/basic_usage/anthropic_api.mdx

0.5.1411.6 KB
Original Source

SGLang ships an Anthropic-compatible /v1/messages endpoint so any client built for the Anthropic Messages API — including the Anthropic SDKs and agentic CLIs such as Claude Code — can talk to a self-hosted SGLang server without changes. A complete reference for the API is available in the Anthropic API Reference.

The endpoint is registered automatically on every SGLang server; no extra flag is required to enable it. It reuses the same model, chat template, and reasoning / tool-call parsers as the OpenAI-compatible endpoint, and supports both non-streaming and streaming responses, tool use, and a count_tokens route.

This tutorial covers:

  • POST /v1/messages (non-streaming and streaming)
  • POST /v1/messages/count_tokens
  • Pointing Claude Code at the server, including the CLAUDE_CODE_ATTRIBUTION_HEADER setting that is required for good prefix-cache reuse.

Launch A Server

Launch the server in your terminal and wait for it to initialize. The Anthropic /v1/messages endpoint is registered automatically — no extra flag is required beyond the usual server launch. The example below is a single-node GLM-5.2-FP8 config; see the GLM-5.2 cookbook for verified commands across hardware and quantizations.

bash
sglang serve \
    --model-path zai-org/GLM-5.2-FP8 \
    --tp 8 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 6 \
    --reasoning-parser glm45 \
    --tool-call-parser glm47 \
    --host 0.0.0.0 \
    --port 30000
<Note> - **The endpoint is model-agnostic.** The `/v1/messages` route is on by default for any model; GLM-5.2 is used here because its reasoning + tool-use output is where Claude Code integration shines, but any model works. - **Model name and `[1m]`.** SGLang does not validate the request `model` field, so Claude Code can send any name. The `[1m]` suffix is a **client-side hint**: Claude Code only enables its 1M-context beta when the model name ends in `[1m]` — without it, context is capped. Set the same `glm-5.2[1m]` in the `ANTHROPIC_DEFAULT_*_MODEL` env vars below. - **`--reasoning-parser` / `--tool-call-parser` are optional.** Add them when the model emits reasoning content (GLM-5.2, Qwen3, DeepSeek-R1, …) or when you want tool calls parsed into structured `tool_use` blocks. Without a tool-call parser, tool schemas are still accepted but the model's tool calls come back as raw text, and Claude Code cannot execute them. - **Context length** defaults to the model's own (1M for GLM-5.2); pass `--context-length` only to cap it. </Note>

Send A Message

Non-Streaming

Use the Anthropic Python SDK pointed at the server. Unlike the OpenAI SDK, the Anthropic SDK appends /v1/messages itself, so base_url is the server root without a /v1 suffix.

python
from anthropic import Anthropic

client = Anthropic(
    base_url="http://127.0.0.1:30000",
    api_key="EMPTY",  # SGLang does not require a real key by default
)

message = client.messages.create(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    messages=[{"role": "user", "content": "List 3 countries and their capitals."}],
)
# A reasoning model may emit a `thinking` block before the `text` block —
# pick the text block rather than assuming content[0].
print(next(b.text for b in message.content if b.type == "text"))

Example Output:

text
Here are 3 countries and their capitals:

1. **France** - Paris
2. **Japan** - Tokyo
3. **Brazil** - Brasília

Streaming

Set stream=True to receive Server-Sent Events as they are produced.

python
with client.messages.stream(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    messages=[{"role": "user", "content": "Say this is a test"}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Example Output:

text
This is a test.

System Prompt

The top-level system field is accepted as a string or as a list of text blocks, matching the Anthropic API shape:

python
message = client.messages.create(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    system="You are a helpful assistant that answers concisely.",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(next(b.text for b in message.content if b.type == "text"))

Example Output:

text
The capital of France is Paris.

Tool Use

Tool definitions follow the Anthropic tools schema. When the server is launched with a --tool-call-parser, the model's tool calls are returned as tool_use content blocks:

python
message = client.messages.create(
    model="zai-org/GLM-5.2-FP8",
    max_tokens=512,
    tools=[
        {
            "name": "get_weather",
            "description": "Get the weather for a city",
            "input_schema": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        }
    ],
    messages=[{"role": "user", "content": "What is the weather in Paris?"}],
)
print(message.stop_reason)
print([b for b in message.content if b.type == "tool_use"])

Example Output:

text
tool_use
[ToolUseBlock(type='tool_use', id='toolu_01XXXX', name='get_weather', input={'city': 'Paris'})]

Counting Tokens

POST /v1/messages/count_tokens returns the tokenized length of a request without generating a response. It reuses the same request conversion as /v1/messages, so system prompts, tools, and multi-turn history are all accounted for.

python
resp = client.messages.count_tokens(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "Hello, world"}],
)
print(resp.input_tokens)

Example Output:

text
15

Using Claude Code

Claude Code can be pointed at an SGLang server by setting a few env vars in the shell that starts it. With the server already running on :30000, export the full set and launch claude:

bash
export ANTHROPIC_BASE_URL="http://127.0.0.1:30000"
export ANTHROPIC_AUTH_TOKEN="dummy"                 # required by Claude Code; any non-empty string works
export API_TIMEOUT_MS="3000000"                     # long timeout — reasoning + 1M-context turns are slow
export CLAUDE_CODE_AUTO_COMPACT_WINDOW="1000000"     # let auto-compact use the full 1M window
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1   # drop autoupdater/telemetry/error-reporting noise
export CLAUDE_CODE_ATTRIBUTION_HEADER=0             # required for prefix-cache reuse — see below
export ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-5.2[1m]"    # [1m] suffix enables Claude Code's 1M-context beta
export ANTHROPIC_DEFAULT_SONNET_MODEL="glm-5.2[1m]"  # [1m] suffix enables Claude Code's 1M-context beta
export ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"     # [1m] suffix enables Claude Code's 1M-context beta
claude

Each var matters:

  • ANTHROPIC_BASE_URL — points Claude Code at your SGLang server instead of the Anthropic API.
  • ANTHROPIC_AUTH_TOKEN — Claude Code requires a non-empty auth token; SGLang accepts any value when launched without --api-key.
  • API_TIMEOUT_MS — raise it; reasoning models with long outputs and 1M-context turns routinely exceed the default timeout.
  • ANTHROPIC_DEFAULT_{HAIKU,SONNET,OPUS}_MODEL — the model name Claude Code sends for each tier. SGLang does not validate this field, so any name works. Use glm-5.2[1m]: the [1m] suffix is a client-side hint that enables Claude Code's 1M-context beta (without it, context is capped).
  • CLAUDE_CODE_AUTO_COMPACT_WINDOW — set to 1000000 so auto-compaction uses the full 1M window instead of the default, keeping long sessions alive.
<Tip> Instead of exporting these in every shell, persist them in `~/.claude/settings.json` under the `env` key — they apply to all Claude Code sessions:
json
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:30000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-5.2[1m]",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.2[1m]",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5.2[1m]"
  }
}
</Tip>

Required: CLAUDE_CODE_ATTRIBUTION_HEADER=0 for prefix-cache reuse

<Note> **Set this whenever Claude Code routes through SGLang (or any non-Anthropic gateway).** Without it, multi-turn conversations re-prefill the whole history every turn. </Note>

Claude Code prepends a per-request attribution block to the start of the system prompt, of the form x-anthropic-billing-header: cc_version=<ver>.<per-request-hash>; cc_entrypoint=...; cch=<hash>;. The per-request hash is the first token to differ between turns, so the radix prefix cache can only reuse the short prefix before that hash and re-prefills the system prompt plus the entire conversation history on every turn.

Setting CLAUDE_CODE_ATTRIBUTION_HEADER=0 removes the whole attribution line from the system prompt. This is a documented Claude Code env var whose explicit purpose is to "improve prompt-cache hit rates when routing through an LLM gateway" (see the Claude Code env-vars reference).

<Note> `CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC` does **not** remove the attribution block — it only covers autoupdater/telemetry/error reporting. The attribution header is a separate code path; use `CLAUDE_CODE_ATTRIBUTION_HEADER=0` for it. </Note>

Troubleshooting

Connection refused / fetch failed — Ensure the server is up and the port in ANTHROPIC_BASE_URL matches --port (default 30000). If you set ANTHROPIC_BASE_URL to a remote host, confirm it's reachable and not behind a proxy that blocks the connection.

Model not found / 404 from the server — SGLang does not validate the request model field and serves whatever model was loaded at startup, so a 404 usually means the request did not reach the /v1/messages route at all. Confirm ANTHROPIC_BASE_URL points at the server (not missing the port) and that the server finished loading.

Tool calls not working / returned as raw text — Launch the server with the correct --tool-call-parser for your model (e.g. glm47, qwen3). Without it the tools field is still accepted but the model's tool calls come back as text instead of tool_use blocks, and Claude Code cannot execute them.

Slow / re-prefills the whole history every turn — You are missing CLAUDE_CODE_ATTRIBUTION_HEADER=0. Claude Code's per-request attribution hash in the system prompt defeats radix prefix-cache reuse; see the section above.

Context capped below 1M — The model name must end in [1m] for Claude Code to enable its 1M-context beta. Verify ANTHROPIC_DEFAULT_*_MODEL uses the [1m] suffix, and that the loaded model's native context is 1M (GLM-5.2 is 1048576; pass --context-length only to cap it, not to extend).

Parameters

The /v1/messages endpoint accepts the standard Anthropic Messages API parameters. Refer to the Anthropic Messages API reference for the full list.

Reasoning models are supported through the same --reasoning-parser mechanism as the OpenAI-compatible endpoint; pass the model's reasoning kwarg via the request (e.g. thinking for DeepSeek-V3-style models, enable_thinking for Qwen3-style models). See OpenAI APIs - Completions for the reasoning-parser / chat-template mapping.