docs_new/cookbook/autoregressive/Poolside/Laguna-M.1.mdx
Laguna-M.1 support is already on SGLang main — softplus per-element attention-output gating (PR #28400) and a global-attention fix (PR #28604, since M.1 is full-attention sliding_window: 0) — but not yet in a tagged release. The two paths below match the Python / Docker toggle in the command panel: install from main (Python tab), or use the Docker image, which bundles the same build (CUDA 13, covers H200 + all Blackwell). The model ships custom config code on the Hub, so --trust-remote-code is required (it is included in the launch commands).
pip install -U uv
uv venv --python 3.12 && source .venv/bin/activate
# Laguna-M.1 support is on SGLang main (PRs #28400 + #28604, plus #28649 for FP8), not yet in a
# tagged release — install from main. The serving runtime is in the base dependencies, no extra needed:
git clone https://github.com/sgl-project/sglang.git
cd sglang
uv pip install -e python
Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image (dev-cu13-618-nightly) bundles the CUDA-13 runtime and the M.1 code. Once M.1 support lands in a tagged release, uv pip install sglang will pull it directly.
# Pinned nightly with the Laguna-M.1 build (PR #28400 + #28604; CUDA 13 — covers H200 + all Blackwell):
docker pull lmsysorg/sglang:dev-cu13-618-nightly
For how to launch the image, see Install → Method 3: Using Docker. Substitute the inner sglang serve ... with what the command generator below produces.
Pick your hardware + quantization to generate the launch command. Laguna-M.1 ships a single Balanced recipe per cell — poolside's recommended operating point, a good speed/throughput trade-off for typical multi-user serving. The 8-GPU HGX platforms (H200 / B200 / B300) use --tp 8; the 4-GPU Grace-Blackwell single nodes (GB200 / GB300) use --tp 4.
import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/poolside/laguna-m1.jsx"; import { benchmarks } from "/src/snippets/configs/poolside/laguna-m1-benchmarks.jsx";
<Deployment config={config} benchmarks={benchmarks} />The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs (parsers, DP-Attention, DeepEP / EP) on top of whichever cell the Deploy panel is currently showing.
import { Playground } from "/src/snippets/_playground.jsx";
<Playground config={config} />Laguna-M.1 is an open-weight, 225B-parameter Mixture-of-Experts model (23B activated per token) from poolside, built for agentic coding and long-horizon software-engineering work. It is released under Apache 2.0.
Key Features:
chat_template_kwargs={"enable_thinking": ...}.Available Quantizations:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "20%"}} /> <col style={{width: "80%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Quantization</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Hugging Face path</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}><strong>BF16</strong></td> <td style={{padding: "9px 12px"}}>[`poolside/Laguna-M.1`](https://huggingface.co/poolside/Laguna-M.1)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}><strong>FP8</strong></td> <td style={{padding: "9px 12px"}}>[`poolside/Laguna-M.1-FP8`](https://huggingface.co/poolside/Laguna-M.1-FP8)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}><strong>NVFP4</strong></td> <td style={{padding: "9px 12px"}}>[`poolside/Laguna-M.1-NVFP4`](https://huggingface.co/poolside/Laguna-M.1-NVFP4)</td> </tr> </tbody> </table>License: Apache 2.0
Resources: Hugging Face · Release blog post · Technical report · API platform.
--trust-remote-code): Laguna-M.1 ships custom modeling/config code on the Hugging Face Hub, so this flag is required for the server to load the model.--mem-fraction-static or cap --context-length.--fp8-gemm-backend triton — the compressed-tensors block-FP8 weight scales aren't UE8M0-packed, so the default DeepGEMM path emits garbage on Blackwell (sm_100); the Triton backend is correct (~19% slower). Temporary workaround pending PR #28662 (which fixes the scales and restores the DeepGEMM fast path). On Hopper (H200) FP8 uses DeepGEMM with no extra flag — pre-warm its multi-session JIT with python3 -m sglang.compile_deep_gemm --model poolside/Laguna-M.1-FP8 to avoid paying it on each restart.poolside_v1): for agentic / tool-using deployments enable the Reasoning Parser and Tool Call Parser in the Playground above — they emit --reasoning-parser poolside_v1 (thinking → reasoning_content) and --tool-call-parser poolside_v1 (structured tool_calls).extra_body={"chat_template_kwargs": {"enable_thinking": True}}.--model-path, so a client's model field must match it — poolside/Laguna-M.1 (BF16) or poolside/Laguna-M.1-FP8 / -NVFP4 for the quantized cells. The §3 examples use the BF16 id; swap in the id you launched.temperature=1.0, top_k=20 with thinking enabled. These are per-request sampling params (not launch flags) — e.g. temperature=1.0, extra_body={"top_k": 20} on the OpenAI client.Launch with --reasoning-parser poolside_v1 (or toggle Reasoning Parser in the Parsers card of the Playground above). Reasoning is opt-in: the Laguna chat template gates it on enable_thinking=True (passed via chat_template_kwargs) — the generic thinking key is ignored. The <think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="poolside/Laguna-M.1",
messages=[{"role": "user", "content": "What is 15% of 240? Explain briefly."}],
max_tokens=2048,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
message = response.choices[0].message
print("=============== Reasoning ===============")
print(message.reasoning_content)
print("=============== Answer ==================")
print(message.content)
=============== Reasoning ===============
Okay, so I need to find out what 15% of 240 is. Hmm, percentages can sometimes be
tricky, but let me think. I remember that "percent" means per hundred, right? So 15%
is the same as 15 per 100 or 15/100. Maybe I can convert that percentage into a decimal
first? ... 15 divided by 100 is 0.15. ... Now, to find 15% of 240, I just need to
multiply 240 by 0.15. ... 240 times 0.1 is 24 (10% of 240), and 240 times 0.05 is 12
(half of that), so 24 + 12 = 36.
[… verifies the same result several more ways: 15/100 × 240, 240 × 15 ÷ 100,
1% × 15, and the fraction 3/20 × 240 — all give 36 …]
So ... all methods are pointing to 36. I'm pretty confident that 15% of 240 is 36.
=============== Answer ==================
To find 15% of 240, convert the percentage to a decimal (0.15) and multiply by 240:
**240 × 0.15 = 36**.
**Step-by-Step Explanation:**
1. **Convert 15% to a decimal:** 15% = 15/100 = 0.15.
2. **Multiply by 240:**
- Break it down:
- 10% of 240 = 24 (since 240 × 0.1 = 24).
- 5% of 240 = 12 (half of 24).
- Add them: 24 + 12 = **36**.
**Answer:** 15% of 240 is **36**.
Launch with --tool-call-parser poolside_v1 (or toggle Tool Call Parser in the Parsers card of the Playground above). The parser converts Laguna's <tool_call> output into the standard OpenAI tool_calls structure. Tool calling works with reasoning off (enable_thinking=False, the default).
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="poolside/Laguna-M.1",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools,
)
message = response.choices[0].message
if message.tool_calls:
for call in message.tool_calls:
print(f"Tool: {call.function.name}")
print(f"Args: {call.function.arguments}")
Tool: get_weather
Args: {"location": "Beijing"}
PD disaggregation runs prefill and decode on separate SGLang servers linked by an RDMA KV-transfer fabric (mooncake or NIXL), fronted by the PD router. Laguna-M.1 is global-attention with a standard KV cache (no sliding window, no sparse "index" side-buffer), so its KV pages transfer with no model-specific flags — just the --disaggregation-* knobs. Both roles auto-select the same attention backend (FlashAttention-3) and page size because they share the model and flags, so the KV layout lines up for transfer.
Supported / validated topology:
--tp.Launch the prefill server, then the decode server — the same recipe with --disaggregation-mode decode and no bootstrap port. Point --disaggregation-ib-device at your RDMA NIC(s).
sglang serve \
--model-path poolside/Laguna-M.1 \
--trust-remote-code \
--reasoning-parser poolside_v1 \
--tool-call-parser poolside_v1 \
--tp 8 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
--host 0.0.0.0 --port 30000 \
--disaggregation-bootstrap-port 8998
sglang serve \
--model-path poolside/Laguna-M.1 \
--trust-remote-code \
--reasoning-parser poolside_v1 \
--tool-call-parser poolside_v1 \
--tp 8 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
--host 0.0.0.0 --port 30001
Then start the PD router, pointing it at the prefill bootstrap (URL plus its --disaggregation-bootstrap-port) and the decode endpoint:
python3 -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://<prefill-host>:30000 8998 \
--decode http://<decode-host>:30001 \
--policy round_robin \
--host 0.0.0.0 --port 8000
Clients hit the router exactly like a single server — it splits each request across the two stages transparently:
<Accordion title="PD Client Example (Python)">from openai import OpenAI
client = OpenAI(base_url="http://<router-host>:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="poolside/Laguna-M.1",
messages=[{"role": "user", "content": "What is 2 + 2?"}],
max_tokens=64,
)
print(response.choices[0].message.content)
Output Example:
2 + 2 = 4
Transfer backend — mooncake (recommended). mooncake honors --disaggregation-ib-device and establishes its RDMA connection at registration, so the first request is already fast (no cold start). It works with a single NIC or all eight; using all 8 NICs lowers TTFT (more aggregate bandwidth for the KV payload — the gap widens at longer context). On 8×H200 (random isl=512 / osl=256, 16 concurrent) it served ≈ 717 tok/s output (≈ 2.2k tok/s total), mean TTFT 244 ms, mean TPOT 17.7 ms; with a single mlx5_0 NIC, ≈ 697 tok/s and TTFT 287 ms (TPOT unchanged — decode is compute-bound).
Transfer backend — NIXL (works, with two caveats).
<Warning> The NIXL path **ignores `--disaggregation-ib-device`** — that flag is mooncake-only. NIXL uses its UCX backend, whose NIC is selected by the **`UCX_NET_DEVICES`** environment variable. **Set it** (e.g. `export UCX_NET_DEVICES=mlx5_0:1`) on both servers; without it UCX cannot establish a working cross-node path and every KV transfer hangs until it hits the 300 s timeout (`Request … timed out … in KVPoll.WaitingForInput`) and returns a 500. </Warning>With UCX_NET_DEVICES pinned, NIXL matches mooncake on quality and steady-state speed (≈ 720 tok/s, TTFT 230 ms, TPOT 17.7 ms). One difference: the first request after launch pays a ~38 s one-time UCX connection cold-start (a single port or all eight behave the same). Warm the path with one throwaway request after startup, or raise SGLANG_DISAGGREGATION_WAITING_TIMEOUT (default 300 s) so the first real request isn't dropped while UCX connects.
Validation. PD disaggregation preserves output quality — disaggregated output matches non-disaggregated serving, and GSM8K (no-thinking, 200-question subset via the router) scored 0.945 (mooncake, 8 NICs) / 0.940 (NIXL) / 0.950 (mooncake, 1 NIC), all with 100% stop-rate and 0% errors — in line with single-node BF16 (≈ 0.93 on the full split). Logs confirm the split: the prefill node logs Prefill batch (CUDA graph off), the decode node logs Decode batch (CUDA graph on).