Back to Sglang

Laguna-M.1

docs_new/cookbook/autoregressive/Poolside/Laguna-M.1.mdx

0.5.1416.4 KB
Original Source

Deployment

<a id="install" /> <Accordion title="Install SGLang">

Laguna-M.1 support is already on SGLang mainsoftplus per-element attention-output gating (PR #28400) and a global-attention fix (PR #28604, since M.1 is full-attention sliding_window: 0) — but not yet in a tagged release. The two paths below match the Python / Docker toggle in the command panel: install from main (Python tab), or use the Docker image, which bundles the same build (CUDA 13, covers H200 + all Blackwell). The model ships custom config code on the Hub, so --trust-remote-code is required (it is included in the launch commands).

<Tabs> <Tab title="Python (pip / uv)">
bash
pip install -U uv
uv venv --python 3.12 && source .venv/bin/activate

# Laguna-M.1 support is on SGLang main (PRs #28400 + #28604, plus #28649 for FP8), not yet in a
# tagged release — install from main. The serving runtime is in the base dependencies, no extra needed:
git clone https://github.com/sgl-project/sglang.git
cd sglang
uv pip install -e python

Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image (dev-cu13-618-nightly) bundles the CUDA-13 runtime and the M.1 code. Once M.1 support lands in a tagged release, uv pip install sglang will pull it directly.

</Tab> <Tab title="Docker">
bash
# Pinned nightly with the Laguna-M.1 build (PR #28400 + #28604; CUDA 13 — covers H200 + all Blackwell):
docker pull lmsysorg/sglang:dev-cu13-618-nightly

For how to launch the image, see Install → Method 3: Using Docker. Substitute the inner sglang serve ... with what the command generator below produces.

</Tab> </Tabs> </Accordion>

Pick your hardware + quantization to generate the launch command. Laguna-M.1 ships a single Balanced recipe per cell — poolside's recommended operating point, a good speed/throughput trade-off for typical multi-user serving. The 8-GPU HGX platforms (H200 / B200 / B300) use --tp 8; the 4-GPU Grace-Blackwell single nodes (GB200 / GB300) use --tp 4.

import { Deployment } from "/src/snippets/_deployment.jsx"; import { config } from "/src/snippets/configs/poolside/laguna-m1.jsx"; import { benchmarks } from "/src/snippets/configs/poolside/laguna-m1-benchmarks.jsx";

<Deployment config={config} benchmarks={benchmarks} />

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs (parsers, DP-Attention, DeepEP / EP) on top of whichever cell the Deploy panel is currently showing.

import { Playground } from "/src/snippets/_playground.jsx";

<Playground config={config} />

1. Model Introduction

Laguna-M.1 is an open-weight, 225B-parameter Mixture-of-Experts model (23B activated per token) from poolside, built for agentic coding and long-horizon software-engineering work. It is released under Apache 2.0.

Key Features:

  • Large sparse MoE: 70-layer transformer — the first 3 layers are dense SwiGLU, the remaining 67 are sparse MoE with 256 experts, top-16 routing (+1 shared expert) and auxiliary-loss-free load balancing.
  • Global attention with output gating: global attention across all layers, 64 Q-heads / 8 KV-heads (head dim 128), with softplus attention output gating (requires PR #28400).
  • Long context: 262,144 tokens, RoPE with YaRN.
  • Agentic coding: competitive on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0.
  • Native reasoning: interleaved thinking between tool calls, toggled per request via chat_template_kwargs={"enable_thinking": ...}.

Available Quantizations:

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "20%"}} /> <col style={{width: "80%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Quantization</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700}}>Hugging Face path</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}><strong>BF16</strong></td> <td style={{padding: "9px 12px"}}>[`poolside/Laguna-M.1`](https://huggingface.co/poolside/Laguna-M.1)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}><strong>FP8</strong></td> <td style={{padding: "9px 12px"}}>[`poolside/Laguna-M.1-FP8`](https://huggingface.co/poolside/Laguna-M.1-FP8)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500}}><strong>NVFP4</strong></td> <td style={{padding: "9px 12px"}}>[`poolside/Laguna-M.1-NVFP4`](https://huggingface.co/poolside/Laguna-M.1-NVFP4)</td> </tr> </tbody> </table>

License: Apache 2.0

Resources: Hugging Face · Release blog post · Technical report · API platform.

2. Configuration Tips

  • Trust remote code (--trust-remote-code): Laguna-M.1 ships custom modeling/config code on the Hugging Face Hub, so this flag is required for the server to load the model.
  • Long-context memory: M.1 is global-attention (no sliding-window), so the 262,144-token KV cache is large. If you hit OOM at full context, lower --mem-fraction-static or cap --context-length.
  • FP8: On Blackwell the recipe adds --fp8-gemm-backend triton — the compressed-tensors block-FP8 weight scales aren't UE8M0-packed, so the default DeepGEMM path emits garbage on Blackwell (sm_100); the Triton backend is correct (~19% slower). Temporary workaround pending PR #28662 (which fixes the scales and restores the DeepGEMM fast path). On Hopper (H200) FP8 uses DeepGEMM with no extra flag — pre-warm its multi-session JIT with python3 -m sglang.compile_deep_gemm --model poolside/Laguna-M.1-FP8 to avoid paying it on each restart.
  • Parsers (poolside_v1): for agentic / tool-using deployments enable the Reasoning Parser and Tool Call Parser in the Playground above — they emit --reasoning-parser poolside_v1 (thinking → reasoning_content) and --tool-call-parser poolside_v1 (structured tool_calls).
  • Thinking default: thinking is off by default; opt in per request with extra_body={"chat_template_kwargs": {"enable_thinking": True}}.
  • Served model id: the server registers the model under whatever you pass to --model-path, so a client's model field must match it — poolside/Laguna-M.1 (BF16) or poolside/Laguna-M.1-FP8 / -NVFP4 for the quantized cells. The §3 examples use the BF16 id; swap in the id you launched.
  • Recommended sampling: poolside benchmarks M.1 at temperature=1.0, top_k=20 with thinking enabled. These are per-request sampling params (not launch flags) — e.g. temperature=1.0, extra_body={"top_k": 20} on the OpenAI client.

3. Advanced Usage

3.1 Reasoning

Launch with --reasoning-parser poolside_v1 (or toggle Reasoning Parser in the Parsers card of the Playground above). Reasoning is opt-in: the Laguna chat template gates it on enable_thinking=True (passed via chat_template_kwargs) — the generic thinking key is ignored. The <think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.

<Accordion title="Reasoning Example (Python)">
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="poolside/Laguna-M.1",
    messages=[{"role": "user", "content": "What is 15% of 240? Explain briefly."}],
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

message = response.choices[0].message
print("=============== Reasoning ===============")
print(message.reasoning_content)
print("=============== Answer ==================")
print(message.content)
</Accordion> <Accordion title="Example Output">
text
=============== Reasoning ===============
Okay, so I need to find out what 15% of 240 is. Hmm, percentages can sometimes be
tricky, but let me think. I remember that "percent" means per hundred, right? So 15%
is the same as 15 per 100 or 15/100. Maybe I can convert that percentage into a decimal
first? ... 15 divided by 100 is 0.15. ... Now, to find 15% of 240, I just need to
multiply 240 by 0.15. ... 240 times 0.1 is 24 (10% of 240), and 240 times 0.05 is 12
(half of that), so 24 + 12 = 36.
[… verifies the same result several more ways: 15/100 × 240, 240 × 15 ÷ 100,
1% × 15, and the fraction 3/20 × 240 — all give 36 …]
So ... all methods are pointing to 36. I'm pretty confident that 15% of 240 is 36.
=============== Answer ==================
To find 15% of 240, convert the percentage to a decimal (0.15) and multiply by 240:
**240 × 0.15 = 36**.

**Step-by-Step Explanation:**
1. **Convert 15% to a decimal:** 15% = 15/100 = 0.15.
2. **Multiply by 240:**
   - Break it down:
     - 10% of 240 = 24 (since 240 × 0.1 = 24).
     - 5% of 240 = 12 (half of 24).
   - Add them: 24 + 12 = **36**.

**Answer:** 15% of 240 is **36**.
</Accordion> <Note> Laguna-M.1's reasoning traces are long — the model explores and re-verifies an answer multiple ways. Give it a generous `max_tokens` for harder problems (reasoning regularly exceeds 3k tokens). The trace above is abbreviated; the model emits it in full. </Note>

3.2 Tool Calling

Launch with --tool-call-parser poolside_v1 (or toggle Tool Call Parser in the Parsers card of the Playground above). The parser converts Laguna's <tool_call> output into the standard OpenAI tool_calls structure. Tool calling works with reasoning off (enable_thinking=False, the default).

<Accordion title="Tool Calling Example (Python)">
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="poolside/Laguna-M.1",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
)

message = response.choices[0].message
if message.tool_calls:
    for call in message.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")
</Accordion> <Accordion title="Example Output">
text
Tool: get_weather
Args: {"location": "Beijing"}
</Accordion>

3.3 Prefill-Decode (PD) Disaggregation

PD disaggregation runs prefill and decode on separate SGLang servers linked by an RDMA KV-transfer fabric (mooncake or NIXL), fronted by the PD router. Laguna-M.1 is global-attention with a standard KV cache (no sliding window, no sparse "index" side-buffer), so its KV pages transfer with no model-specific flags — just the --disaggregation-* knobs. Both roles auto-select the same attention backend (FlashAttention-3) and page size because they share the model and flags, so the KV layout lines up for transfer.

Supported / validated topology:

  • Equal tensor parallelism — prefill and decode run the same --tp.
  • Single pipeline stage — PP = 1 (the default).
  • mooncake or NIXL transfer backend over RDMA / InfiniBand.
  • Validated on 2 × 8×H200 (TP8 prefill + TP8 decode, BF16), one node each, over an 8× 400 Gb/s NDR InfiniBand fabric.

Launch the prefill server, then the decode server — the same recipe with --disaggregation-mode decode and no bootstrap port. Point --disaggregation-ib-device at your RDMA NIC(s).

bash
sglang serve \
  --model-path poolside/Laguna-M.1 \
  --trust-remote-code \
  --reasoning-parser poolside_v1 \
  --tool-call-parser poolside_v1 \
  --tp 8 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake \
  --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
  --host 0.0.0.0 --port 30000 \
  --disaggregation-bootstrap-port 8998
bash
sglang serve \
  --model-path poolside/Laguna-M.1 \
  --trust-remote-code \
  --reasoning-parser poolside_v1 \
  --tool-call-parser poolside_v1 \
  --tp 8 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake \
  --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
  --host 0.0.0.0 --port 30001

Then start the PD router, pointing it at the prefill bootstrap (URL plus its --disaggregation-bootstrap-port) and the decode endpoint:

bash
python3 -m sglang_router.launch_router \
  --pd-disaggregation \
  --prefill http://<prefill-host>:30000 8998 \
  --decode http://<decode-host>:30001 \
  --policy round_robin \
  --host 0.0.0.0 --port 8000

Clients hit the router exactly like a single server — it splits each request across the two stages transparently:

<Accordion title="PD Client Example (Python)">
python
from openai import OpenAI

client = OpenAI(base_url="http://<router-host>:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="poolside/Laguna-M.1",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    max_tokens=64,
)
print(response.choices[0].message.content)

Output Example:

text
2 + 2 = 4
</Accordion>

Transfer backend — mooncake (recommended). mooncake honors --disaggregation-ib-device and establishes its RDMA connection at registration, so the first request is already fast (no cold start). It works with a single NIC or all eight; using all 8 NICs lowers TTFT (more aggregate bandwidth for the KV payload — the gap widens at longer context). On 8×H200 (random isl=512 / osl=256, 16 concurrent) it served ≈ 717 tok/s output (≈ 2.2k tok/s total), mean TTFT 244 ms, mean TPOT 17.7 ms; with a single mlx5_0 NIC, ≈ 697 tok/s and TTFT 287 ms (TPOT unchanged — decode is compute-bound).

Transfer backend — NIXL (works, with two caveats).

<Warning> The NIXL path **ignores `--disaggregation-ib-device`** — that flag is mooncake-only. NIXL uses its UCX backend, whose NIC is selected by the **`UCX_NET_DEVICES`** environment variable. **Set it** (e.g. `export UCX_NET_DEVICES=mlx5_0:1`) on both servers; without it UCX cannot establish a working cross-node path and every KV transfer hangs until it hits the 300 s timeout (`Request … timed out … in KVPoll.WaitingForInput`) and returns a 500. </Warning>

With UCX_NET_DEVICES pinned, NIXL matches mooncake on quality and steady-state speed (≈ 720 tok/s, TTFT 230 ms, TPOT 17.7 ms). One difference: the first request after launch pays a ~38 s one-time UCX connection cold-start (a single port or all eight behave the same). Warm the path with one throwaway request after startup, or raise SGLANG_DISAGGREGATION_WAITING_TIMEOUT (default 300 s) so the first real request isn't dropped while UCX connects.

Validation. PD disaggregation preserves output quality — disaggregated output matches non-disaggregated serving, and GSM8K (no-thinking, 200-question subset via the router) scored 0.945 (mooncake, 8 NICs) / 0.940 (NIXL) / 0.950 (mooncake, 1 NIC), all with 100% stop-rate and 0% errors — in line with single-node BF16 (≈ 0.93 on the full split). Logs confirm the split: the prefill node logs Prefill batch (CUDA graph off), the decode node logs Decode batch (CUDA graph on).