docs_new/cookbook/autoregressive/Poolside/Laguna-XS.2.mdx
Laguna-XS.2 is an open-source hybrid sliding-window-attention MoE model from Poolside, built for agentic coding and long-horizon software engineering work.
Key Features:
<think>...</think> segments toggled per request via chat_template_kwargs={"enable_thinking": ...}.Available Quantizations:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "20%"}} /> <col style={{width: "80%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.02)"}}>Variant</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, backgroundColor: "rgba(255,255,255,0.05)"}}>Hugging Face path</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>BF16</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[`poolside/Laguna-XS.2`](https://huggingface.co/poolside/Laguna-XS.2)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>FP8</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[`poolside/Laguna-XS.2-FP8`](https://huggingface.co/poolside/Laguna-XS.2-FP8)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}><strong>NVFP4</strong></td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[`poolside/Laguna-XS.2-NVFP4`](https://huggingface.co/poolside/Laguna-XS.2-NVFP4)</td> </tr> </tbody> </table>License: Apache 2.0
For details, see the Hugging Face model card and the Laguna deeper-dive blog post.
Laguna-XS.2 support is on main but not yet in a tagged release; install from the SGLang nightly wheel index, or pull a pre-built Docker image:
# Install SGLang via pip (CUDA 13) — requires Python 3.10 (nightly wheels are cp310 only)
python3 -m pip install --upgrade pip
python3 -m pip install --extra-index-url https://docs.sglang.ai/whl/cu130 \
"sglang[all]==0.5.12.dev20260509+g096ad02b0"
# CUDA 12: swap to the cu129 index
python3 -m pip install --extra-index-url https://docs.sglang.ai/whl/cu129 \
"sglang[all]==0.5.12.dev20260509+g096ad02b0"
# Or use Docker (multi-arch amd64/arm64)
docker pull lmsysorg/sglang:dev-cu13-laguna-xs2 # CUDA 13 (H200 / B200)
docker pull lmsysorg/sglang:dev-cu12-laguna-xs2 # CUDA 12 (H200)
For the full Docker setup and other installation methods, please refer to the official SGLang installation guide.
Interactive Command Generator: Use the configuration selector below to generate a launch command for your hardware.
import { LagunaXS2Deployment } from '/src/snippets/autoregressive/laguna-xs2-deployment.jsx';
<LagunaXS2Deployment />python3 -m sglang.compile_deep_gemm --model poolside/Laguna-XS.2-FP8 to avoid that cost on every restart.--reasoning-parser poolside_v1): Splits <think>...</think> segments into reasoning_content so content holds only the final answer. Disable only if you want the raw <think> tags in content.--tool-call-parser poolside_v1): Required for OpenAI-compatible tool-call streaming. Disable only for chat-only deployments.--dp <N> --enable-dp-attention with --dp matching --tp (tune independently if needed).extra_body={"chat_template_kwargs": {"enable_thinking": True}}.The samples below assume the server is reachable at http://localhost:30000/v1.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
resp = client.chat.completions.create(
model="poolside/Laguna-XS.2",
messages=[
{"role": "user", "content": "What is the difference between TCP and UDP?"}
],
max_tokens=1024,
)
print(resp.choices[0].message.content)
Output Example:
TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are two core protocols of the Internet Protocol (IP) suite, both used for network communication but with key differences:
## Connection Handling
- **TCP**: Connection-oriented protocol that establishes a connection before data transfer (like a phone call)
- **UDP**: Connectionless protocol that sends data without establishing a connection (like sending a letter)
## Reliability
- **TCP**: Guaranteed delivery with error checking, retransmission of lost packets, and flow control
- **UDP**: No guarantee of delivery; packets may be lost, duplicated, or arrive out of order
## Speed & Overhead
- **TCP**: Slower due to connection setup, acknowledgment overhead, and error correction mechanisms
- **UDP**: Faster with minimal overhead since it doesn't wait for acknowledgments or retransmit lost data
## Use Cases
- **TCP**: Web browsing (HTTP/HTTPS), email (SMTP), file transfers (FTP), database connections
- **UDP**: Video streaming, online gaming, VoIP calls, DNS queries, live broadcasts
In essence, TCP prioritizes reliability over speed, while UDP prioritizes speed over reliability.
Laguna-XS.2 emits reasoning between <think>...</think> tags. The --reasoning-parser poolside_v1 flag separates the thinking text into reasoning_content so content holds only the final answer. Thinking is opt-in per request:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
resp = client.chat.completions.create(
model="poolside/Laguna-XS.2",
messages=[
{"role": "user", "content": "If a train travels at 60 km/h for 2.5 hours, how far does it go?"}
],
max_tokens=4096,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print("====== Reasoning Content ======")
print(resp.choices[0].message.reasoning_content)
print("====== Answer ======")
print(resp.choices[0].message.content)
Output Example:
====== Reasoning Content ======
The user is asking a straightforward math problem about distance, speed, and time. I need to calculate the distance using the formula:
Distance = Speed × Time
Given:
- Speed = 60 km/h
- Time = 2.5 hours
So the calculation would be:
Distance = 60 × 2.5 = 150 km
This is a simple multiplication problem. I should provide a clear, direct answer and maybe explain the calculation briefly.
====== Answer ======
To find the distance, use the formula:
Distance = Speed × Time
Distance = 60 km/h × 2.5 h = 150 km
The train travels **150 kilometers**.
To disable thinking, omit extra_body (off by default) or pass chat_template_kwargs={"enable_thinking": False} explicitly.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
},
"required": ["location"],
},
},
}
]
resp = client.chat.completions.create(
model="poolside/Laguna-XS.2",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
)
msg = resp.choices[0].message
print("====== Reasoning Content ======")
print(msg.reasoning_content)
print("====== Content ======")
print(msg.content)
print("====== Tool Calls ======")
for tc in msg.tool_calls or []:
print(f" Function: {tc.function.name}")
print(f" Arguments: {tc.function.arguments}")
Output Example:
====== Reasoning Content ======
None
====== Content ======
I'll check the current weather in Tokyo for you.
====== Tool Calls ======
Function: get_weather
Arguments: {"location": "Tokyo"}
reasoning_content is None because thinking is off by default; content carries the brief assistant message that precedes the tool call. Add extra_body={"chat_template_kwargs": {"enable_thinking": True}} if you want interleaved reasoning before the tool call.
Test Environment:
poolside/Laguna-XS.2 (BF16)0.5.12.dev20260509+g096ad02b0 (nightly wheel containing the #24204 merge commit; same code path as the original PR runs)poolside_v1poolside_v1temperature=0.6, max_tokens=16384, chat_template_kwargs={"enable_thinking": true}, n_repeats=1math_verify (math) and eval_mcq (multichoice)Results (from PR #24204):
| Eval | Accuracy |
|---|---|
| GPQA Diamond | 0.5556 |
| AIME 25 | 0.5667 |
| MMLU | 0.836 |
| SWE-Bench Verified | 0.6540 |
Test Environment:
poolside/Laguna-XS.2 (BF16)0.5.12.dev20260509+g096ad02b0 (nightly wheel containing the #24204 merge commit; same code path as the original PR runs)sglang.bench_serving --backend sglang --dataset-name random (defaults: --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.0)python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 10 --max-concurrency 1
| Metric | TP=1 | TP=4 |
|---|---|---|
| Successful requests | 10 | 10 |
| Output token throughput (tok/s) | 193.10 | 238.88 |
| Total token throughput (tok/s) | 471.82 | 583.68 |
| Mean TTFT (ms) | 35.32 | 24.17 |
| Mean TPOT (ms) | 5.10 | 4.13 |
| Median ITL (ms) | 5.14 | 4.14 |
python3 -m sglang.bench_serving --backend sglang \
--host 0.0.0.0 --port 30000 \
--dataset-name random --num-prompts 1000 --max-concurrency 100
| Metric | TP=1 | TP=4 |
|---|---|---|
| Successful requests | 1000 | 1000 |
| Request throughput (req/s) | 7.32 | 14.61 |
| Output token throughput (tok/s) | 3739.30 | 7465.18 |
| Peak output token throughput (tok/s) | 4718.00 | 10133.00 |
| Total token throughput (tok/s) | 7485.82 | 14944.81 |
| Mean TTFT (ms) | 115.17 | 68.36 |
| Mean TPOT (ms) | 25.51 | 12.71 |
| Median ITL (ms) | 21.31 | 10.64 |
TP=4 delivers roughly 2.0× total-token throughput and ~1.7× lower mean TTFT compared to TP=1 on the cc=100 random workload.