API_SERVER_README.md
This is an HTTP API server for gemma.cpp that implements the Google API protocol, allowing you to interact with Gemma models through REST API endpoints compatible with the Google API format.
generateContent endpointstreamGenerateContent/v1beta/models endpointThe API server is built alongside the main gemma.cpp project:
# Configure the build
cmake -B build -DCMAKE_BUILD_TYPE=Release
# Build the API server and client
cmake --build build --target gemma_api_server gemma_api_client -j 8
The binaries will be created at:
build/gemma_api_server - Local API serverbuild/gemma_api_client - Unified client for both local and public APIs./build/gemma_api_server \
--tokenizer path/to/tokenizer.spm \
--weights path/to/model.sbs \
--port 8080
Required arguments:
--tokenizer: Path to the tokenizer file (.spm)--weights: Path to the model weights file (.sbs)Optional arguments:
--port: Port to listen on (default: 8080)--model: Model name for API endpoints (default: gemma3-4b)# Interactive chat with local server
./build/gemma_api_client --interactive 1 --host localhost --port 8080
# Single prompt with local server
./build/gemma_api_client --prompt "Hello, how are you?"
# Set API key and use public API
export GOOGLE_API_KEY="your-api-key-here"
./build/gemma_api_client --interactive 1
# Or pass API key directly
./build/gemma_api_client --api_key "your-api-key" --interactive 1
The server implements Google API endpoints:
POST /v1beta/models/gemma3-4b:generateContentGenerate a response for given content (non-streaming).
Request:
{
"contents": [
{
"parts": [
{"text": "Why is the sky blue?"}
]
}
],
"generationConfig": {
"temperature": 0.9,
"topK": 1,
"maxOutputTokens": 1024
}
}
Response:
{
"candidates": [
{
"content": {
"parts": [
{"text": "The sky appears blue because..."}
],
"role": "model"
},
"finishReason": "STOP",
"index": 0
}
],
"promptFeedback": {
"safetyRatings": []
},
"usageMetadata": {
"promptTokenCount": 5,
"candidatesTokenCount": 25,
"totalTokenCount": 30
}
}
POST /v1beta/models/gemma3-4b:streamGenerateContentGenerate a response with Server-Sent Events (SSE) streaming.
Request: Same as above
Response: Stream of SSE events:
data: {"candidates":[{"content":{"parts":[{"text":"The"}],"role":"model"},"index":0}],"promptFeedback":{"safetyRatings":[]}}
data: {"candidates":[{"content":{"parts":[{"text":" sky"}],"role":"model"},"index":0}],"promptFeedback":{"safetyRatings":[]}}
data: [DONE]
GET /v1beta/modelsList available models.
Response:
{
"models": [
{
"name": "models/gemma3-4b",
"displayName": "Gemma3 4B",
"description": "Gemma3 4B model running locally"
}
]
}
# Generate content (non-streaming)
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:generateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [{"parts": [{"text": "Hello, how are you?"}]}],
"generationConfig": {"temperature": 0.9, "topK": 1, "maxOutputTokens": 1024}
}'
# Stream generate content (SSE)
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:streamGenerateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [{"parts": [{"text": "Tell me a story"}]}],
"generationConfig": {"temperature": 0.9, "topK": 1, "maxOutputTokens": 1024}
}'
# List models
curl http://localhost:8080/v1beta/models
# First message
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:generateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [
{"parts": [{"text": "Hi, my name is Alice"}]}
]
}'
# Follow-up message with conversation history
curl -X POST http://localhost:8080/v1beta/models/gemma3-4b:generateContent \
-H "Content-Type: application/json" \
-d '{
"contents": [
{"parts": [{"text": "Hi, my name is Alice"}]},
{"parts": [{"text": "Hello Alice! Nice to meet you."}]},
{"parts": [{"text": "What is my name?"}]}
]
}'
import requests
# Generate content
response = requests.post('http://localhost:8080/v1beta/models/gemma3-4b:generateContent',
json={
'contents': [{'parts': [{'text': 'Explain quantum computing in simple terms'}]}],
'generationConfig': {
'temperature': 0.9,
'topK': 1,
'maxOutputTokens': 1024
}
}
)
result = response.json()
if 'candidates' in result and result['candidates']:
text = result['candidates'][0]['content']['parts'][0]['text']
print(text)
The Google API supports various generation configuration options:
This implementation is compatible with: