sgl-model-gateway/bindings/golang/examples/oai_server/README.md
Go SGLang Router is a high-performance OpenAI-compatible API server that communicates with the SGLang backend via gRPC and performs efficient preprocessing and postprocessing through Rust FFI.
Important Note: gRPC mode still calls FFI, which is used for:
gRPC is only used for communication with the SGLang backend, while input/output processing completely relies on Rust FFI.
┌─────────────────────────────────────────────────────────────────┐
│ HTTP Client │
│ (OpenAI API Format) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FastHTTP Server │
│ handlers/chat.go:HandleChatCompletion │
│ - Parse request JSON │
│ - SetBodyStreamWriter (SSE) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SGLang Client (client.go) │
│ CreateChatCompletionStream(ctx, req) │
│ - Wraps gRPC client │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ gRPC Client (internal/grpc/client_grpc.go) │
│ CreateChatCompletionStream(ctx, reqJSON) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Step 1: FFI Preprocess (Rust FFI) │ │
│ │ - ffi.PreprocessChatRequestWithTokenizer() │ │
│ │ - chat_template application │ │
│ │ - tokenization │ │
│ │ - tool constraints generation │ │
│ │ Returns: PromptText, TokenIDs, ToolConstraintsJSON, │ │
│ │ PromptTokens │ │
│ └────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Step 2: Build gRPC Request │ │
│ │ - Parse request JSON (model, temperature, etc.) │ │
│ │ - Create proto.GenerateRequest │ │
│ │ - Set TokenizedInput (PromptText, TokenIDs) │ │
│ │ - Set SamplingParams (temperature, top_p, top_k, etc.) │ │
│ │ - Set Constraints (from ToolConstraintsJSON) │ │
│ └────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Step 3: Create gRPC Stream │ │
│ │ - client.Generate(generateReq) → gRPC stream │ │
│ │ - Connects to SGLang Backend (Rust) │ │
│ └────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Step 4: Create Converter & BatchPostprocessor │ │
│ │ - ffi.CreateGrpcResponseConverterWithTokenizer() │ │
│ │ - Uses preprocessed.PromptTokens for initial count │ │
│ │ - ffi.NewBatchPostprocessor(batchSize=1, immediate) │ │
│ └────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Step 5: Start readLoop (Background Goroutine) │ │
│ │ - go grpcStream.readLoop() │ │
│ │ - Returns GrpcChatCompletionStream immediately │ │
│ └────────────────────┬─────────────────────────────────────┘ │
└───────────────────────┼────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ GrpcChatCompletionStream.readLoop() │
│ (Background Goroutine) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Recv() Goroutine (Dedicated) │ │
│ │ - Continuously calls stream.Recv() │ │
│ │ - Sends results to recvChan (buffered, 2000) │ │
│ │ - Exits on ctx.Done() or error │ │
│ │ - Calls stream.CloseSend() on ctx.Done() │ │
│ └────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Main Loop │ │
│ │ - Reads from recvChan │ │
│ │ - For each proto.GenerateResponse: │ │
│ │ → go processAndSendResponse() (async) │ │
│ │ - protoToJSON() converts proto to JSON string │ │
│ │ - batchPostprocessor.AddChunk(protoJSON) │ │
│ │ → FFI postprocessing (token decoding, tool parsing)│ │
│ │ → Returns OpenAI-format JSON strings │ │
│ │ - Sends JSON to resultJSONChan (buffered, 10000) │ │
│ │ - All operations check ctx.Done() for cancellation │ │
│ │ - On EOF: flush batch, send remaining results, return │ │
│ │ - On error: send to errChan (buffered, 100) │ │
│ │ - defer: cancel ctx, wait goroutines, close channels │ │
│ └────────────────────┬─────────────────────────────────────┘ │
└───────────────────────┼────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ resultJSONChan (Buffered Channel, 10000) │
│ - Contains OpenAI-format JSON strings │
│ - Ready for consumption │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ChatCompletionStream.RecvJSON() │
│ (client.go:410) │
│ - Direct wrapper: return grpcStream.RecvJSON() │
│ - No intermediate processing │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FastHTTP SetBodyStreamWriter │
│ (handlers/chat.go:159) │
│ - Loop: stream.RecvJSON() → format SSE → flush │
│ - Format: "data: {json}\n\n" │
│ - Final: "data: [DONE]\n\n" │
│ - Immediate flush after each chunk │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ HTTP Client │
│ (SSE Stream) │
│ Receives: data: {...}\n\n │
└─────────────────────────────────────────────────────────────────┘
./run.sh
The server will start on port :8080.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/path/to/model",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
TokenizerHandle at startupArc<dyn TokenizerTrait>, thread-safecontext.Context cancellation mechanismreadLoop's defer: cancel context first, then wait for all goroutines to complete, finally close channelsprocessAndSendResponse checks ctx.Done() at function start, all select statements include case <-s.ctx.Done()Recv()recvChanCloseSend() when context is cancelled to make Recv() return errorresultJSONChan: Main data channel (gRPC layer)errChan: Error channel (gRPC layer)recvChan: Internal communication channel (gRPC layer)type ChannelBufferSizes struct {
ResultJSONChan int // Default: 10000
ErrChan int // Default: 100
RecvChan int // Default: 2000
}
type Timeouts struct {
KeepaliveTime time.Duration // Default: 300s
KeepaliveTimeout time.Duration // Default: 20s
CloseTimeout time.Duration // Default: 5s
}
RecvJSON() avoids parse/serialize overheadreadLoop processes in background, doesn't block request handlingsgl-model-gateway/bindings/golang/
├── client.go # High-level client API
├── internal/
│ ├── grpc/
│ │ └── client_grpc.go # gRPC client implementation
│ ├── ffi/ # FFI bindings (Rust)
│ └── proto/ # Protobuf definitions
└── examples/
└── oai_server/
├── handlers/
│ └── chat.go # HTTP request handling
├── models/
│ └── chat.go # Request/response models
└── service/
└── sglang_service.go # Service layer
SetBodyStreamWriter detects flush errorreadLoop detects ctx.Done()Recv() goroutine returns errorclosed flagprocessAndSendResponse goroutines to complete (processWg.Wait())resultJSONChan, errChan, readLoopDone)select statements with case <-s.ctx.Done()readLoop's defer uses processWg.Wait() to ensure all goroutines complete before closing channelsLocation: internal/grpc/client_grpc.go:108
readLoopLocation: internal/grpc/client_grpc.go:290
stream.Recv())processAndSendResponse (tracked with processWg)closed flagprocessAndSendResponse goroutines to complete (processWg.Wait())resultJSONChan, errChan, readLoopDone)Location: internal/grpc/client_grpc.go:379
ctx.Done() at function start, return immediately if cancelledselect statements include case <-s.ctx.Done() for graceful shutdown handlingLocation:
internal/grpc/client_grpc.go:412: gRPC layer implementationclient.go:410: Client wrapper layerresultJSONChan