docs_new/docs/basic_usage/ollama_api.mdx
SGLang provides Ollama API compatibility, allowing you to use the Ollama CLI and Python library with SGLang as the inference backend.
<Note>You don't need the Ollama server installed - SGLang acts as the backend. You only need the ollama CLI or Python library as the client.</Note>
<Note>The model name used with ollama run must match exactly what you passed to --model.</Note>
OLLAMA_HOST=http://localhost:30001 ollama run "Qwen/Qwen2.5-1.5B-Instruct"
</CodeGroup>
If connecting to a remote server behind a firewall:
<CodeGroup>
```bash Command
# SSH tunnel
ssh -L 30001:localhost:30001 user@gpu-server -N &
# Then use Ollama CLI as above
OLLAMA_HOST=http://localhost:30001 ollama list
import ollama
client = ollama.Client(host='http://localhost:30001')
# Non-streaming
response = client.chat(
model='Qwen/Qwen2.5-1.5B-Instruct',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response['message']['content'])
# Streaming
stream = client.chat(
model='Qwen/Qwen2.5-1.5B-Instruct',
messages=[{'role': 'user', 'content': 'Tell me a story'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
For intelligent routing between local Ollama (fast) and remote SGLang (powerful) using an LLM judge, see the Smart Router documentation.