docs_new/cookbook/autoregressive/InclusionAI/Ling-2.6.mdx
The Ling-2.6 family from inclusionAI is the next iteration of the Ling instant-model series. Continuing the architectural direction set by Ling-2.5, Ling-2.6 doubles down on inference efficiency, token efficiency, and agent performance — staying competitive with frontier instant models while being faster, leaner, and better suited for production agent workloads.
Key Features:
1:7 MLA + Lightning Linear hybrid built on top of a highly sparse MoE backbone. Compared with same-class SOTA models, Ling-2.6-flash shows up to ~4× higher prefill and decode throughput in long-context scenarios; Ling-2.6-1T is shipped in FP8 so it fits a single GB300 node with --tp 4.Available Models:
License: MIT
SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.
Please refer to the official SGLang installation guide for installation instructions.
Ling-2.6-flash is a 104B/7.4B-active MoE that runs comfortably on a single 4-GPU node. Use the selector below to generate the launch command for your hardware.
import { Ling26FlashDeployment } from '/src/snippets/autoregressive/ling-26-flash-deployment.jsx'
<Ling26FlashDeployment />--trust-remote-code is required (custom BailingMoeV2_5ForCausalLM modeling code).--tp-size 4 is the reference layout. On 4× H20-3e the model reaches ~340 tokens/s decode at TP=4, batch 32.--json-model-override-args '{"rope_scaling": {"rope_type": "yarn", "factor": 2.0, ...}}') to extend to 256K — the snippet does this for you.--tool-call-parser qwen25 matches the model's <tool_call>...</tool_call> schema.--reasoning-parser qwen3. Ling-2.6 is a controllable-reasoning model whose chat template defaults to detailed thinking off; the SGLang qwen3 reasoning parser, in contrast, assumes default-thinking semantics and would mis-route normal output into reasoning_content. Only enable it if you specifically want <think>...</think> blocks split out — see §4.3 Thinking Mode.--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mamba-scheduler-strategy extra_buffer to enable it — see the model card for the full example.Ling-2.6-1T ships in FP8 (E4M3), so unlike Ling-2.5-1T it fits a single GB300 node with --tp 4. On smaller GPUs (H200/B200), a 2-node deployment with --pp-size 2 is required.
import { Ling261TDeployment } from '/src/snippets/autoregressive/ling-26-1t-deployment.jsx'
<Ling261TDeployment />--trust-remote-code is required for the custom modeling code.--model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}' significantly speeds up the multi-shard FP8 weight load (26 safetensors shards + an MTP layer).--tool-call-parser qwen for tool calling.--reasoning-parser qwen3. Ling-2.6's chat template defaults to detailed thinking off, while SGLang's qwen3 reasoning parser assumes default-thinking semantics — combining the two requires a per-request workaround for tool calls (see §4.3 Thinking Mode). Only enable --reasoning-parser qwen3 if you specifically want <think>...</think> blocks split into reasoning_content.MASTER_IP, PORT, and DIST_PORT consistently across both nodes.For example, launch a Ling-2.6-1T server on a single GB300 node:
sglang serve \
--model-path inclusionAI/Ling-2.6-1T \
--tp-size 4 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000 \
--tool-call-parser qwen \
--model-loader-extra-config '{"enable_multithread_load":"true","num_threads":64}'
curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
Output:
{
"id": "...",
"object": "chat.completion",
"model": "auto",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is **Paris**.",
"reasoning_content": null,
"tool_calls": null
},
"finish_reason": "stop"
}
]
}
curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Search for the latest news about AI"}],
"tools": [{
"type": "function",
"function": {
"name": "search",
"description": "Search for information on the internet",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"}
},
"required": ["query"]
}
}
}],
"tool_choice": "auto"
}'
Output:
{
"choices": [
{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_...",
"type": "function",
"function": {
"name": "search",
"arguments": "{\"query\": \"latest news about AI\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}
Both Ling-2.6-flash and Ling-2.6-1T are controllable-reasoning models. Their chat template uses textual directives in the system message — detailed thinking on or detailed thinking off — to toggle thinking. The template defaults to detailed thinking off when neither phrase is present, and it does not read the Qwen3-style enable_thinking template variable.
Include detailed thinking on in the first system message:
curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "If a box has 12 red balls and 8 blue balls, then 5 red balls are removed, how many balls remain?"}
]
}'
If you already have a system prompt, append the directive on its own line:
{"role": "system", "content": "You are a helpful assistant.\ndetailed thinking on"}
When thinking is on, the model emits <think>...</think> blocks before its final answer. To get those split into message.reasoning_content automatically, also launch the server with --reasoning-parser qwen3.
--reasoning-parser qwen3 + tool callingThe SGLang qwen3 reasoning parser was written for Qwen3, where models are default-thinking and clients opt out via chat_template_kwargs.enable_thinking=false. Ling-2.6 is the opposite — default-non-thinking, with toggling done in the system message. As a result, when the server is launched with both --tool-call-parser qwen and --reasoning-parser qwen3, every tool-call request must include chat_template_kwargs.enable_thinking=false, otherwise the parser routes the <tool_call>...</tool_call> block into reasoning_content instead of message.tool_calls:
curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Search for the latest news about AI"}],
"tools": [...],
"tool_choice": "auto",
"chat_template_kwargs": {"enable_thinking": false}
}'
enable_thinking here is consumed by the SGLang reasoning parser, not by the chat template — Ling-2.6's template ignores it. For the simplest configuration, just omit --reasoning-parser qwen3 and toggle thinking via the system message.
For more API examples, see the SGLang Basic Usage Guide.
Reference run on a single GB300 node with --tp 4:
python3 benchmark/gsm8k/bench_sglang.py
Accuracy: 0.9621 (1269 / 1319)
For Ling-2.6-flash, see the official numbers on the model card (BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, PinchBench, Artificial Analysis).