SGLang Engine

SGLang provides a direct inference engine without the need for an HTTP server. There are generally these use cases:

Offline Batch Inference
Embedding Generation
Custom Server
Token-In-Token-Out for RLHF
Inference Using FastAPI

Examples

Offline Batch Inference

In this example, we launch an SGLang engine and feed a batch of inputs for inference. If you provide a very large batch, the engine will intelligently schedule the requests to process efficiently and prevent OOM (Out of Memory) errors.

Embedding Generation

In this example, we launch an SGLang engine and feed a batch of inputs for embedding generation.

Custom Server

This example demonstrates how to create a custom server on top of the SGLang Engine. We use Sanic as an example. The server supports both non-streaming and streaming endpoints.

Steps

Install Sanic:
bash
```
pip install sanic
```
Run the server:
bash
```
python custom_server
```

Send requests:

bash

curl -X POST http://localhost:8000/generate  -H "Content-Type: application/json"  -d '{"prompt": "The Transformer architecture is..."}'
curl -X POST http://localhost:8000/generate_stream  -H "Content-Type: application/json"  -d '{"prompt": "The Transformer architecture is..."}' --no-buffer

This will send both non-streaming and streaming requests to the server.

Token-In-Token-Out for RLHF

In this example, we launch an SGLang engine, feed tokens as input and generate tokens as output.

Inference Using FastAPI

This example demonstrates how to create a FastAPI server that uses the SGLang engine for text generation.